533 9 9MB
English Pages 236 p; [259] Year 2020
Sharing Economy and Big Data Analytics
Series Editor Jean-Charles Pomerol
Sharing Economy and Big Data Analytics
Soraya Sedkaoui Mounia Khelfaoui
First published 2020 in Great Britain and the United States by ISTE Ltd and John Wiley & Sons, Inc.
Apart from any fair dealing for the purposes of research or private study, or criticism or review, as permitted under the Copyright, Designs and Patents Act 1988, this publication may only be reproduced, stored or transmitted, in any form or by any means, with the prior permission in writing of the publishers, or in the case of reprographic reproduction in accordance with the terms and licenses issued by the CLA. Enquiries concerning reproduction outside these terms should be sent to the publishers at the undermentioned address: ISTE Ltd 27-37 St George’s Road London SW19 4EU UK
John Wiley & Sons, Inc. 111 River Street Hoboken, NJ 07030 USA
www.iste.co.uk
www.wiley.com
© ISTE Ltd 2020 The rights of Soraya Sedkaoui and Mounia Khelfaoui to be identified as the authors of this work have been asserted by them in accordance with the Copyright, Designs and Patents Act 1988. Library of Congress Control Number: 2019951392 British Library Cataloguing-in-Publication Data A CIP record for this book is available from the British Library ISBN 978-1-78630-506-0
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xi
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
xiii
Part 1. The Sharing Economy or the Emergence of a New Business Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Chapter 1. The Sharing Economy: A Concept Under Construction .
3
1.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2. From simple sharing to the sharing economy . . . . . . . . . 1.2.1. The genesis of the sharing economy and the break with “consumer” society . . . . . . . . . . . . . . . . 1.2.2. The sharing economy: which economy? . . . . . . . . . 1.3. The foundations of the sharing economy . . . . . . . . . . . 1.3.1. Peer-to-peer (P2P): a revolution in computer networks 1.3.2. The gift: the abstract aspect of the sharing economy . . 1.3.3. The service economy and the offer of use . . . . . . . . 1.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
5 8 10 10 13 18 24
Chapter 2. An Opportunity for the Business World . . . . . . . . . . . .
25
2.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2. Prosumption: a new sharing economy trend for the consumer . . . 2.3. Poverty: a target in the spotlight of the shared economy . . . . . . 2.4. Controversies on economic opportunities of the sharing economy 2.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
25 27 29 31 37
. . . . .
. . . . . . .
3 5
. . . . .
. . . . .
vi
Sharing Economy and Big Data Analytics
Chapter 3. Risks and Issues of the Sharing Economy . . . . . . . . . . 3.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3.2. Uberization: a white grain or just a summer breeze? 3.3. The sharing economy: a disruptive model . . . . . . . 3.4. Major issues of the sharing economy . . . . . . . . . . 3.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
39 40 43 47 50
Chapter 4. Digital Platforms and the Sharing Mechanism . . . . . . .
51
4.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2. Digital platforms: “What growth!” . . . . . . . . . . . . . . . . . 4.3. Digital platforms or technology at the service of the economy 4.4. From the sharing economy to the sharing platform economy . 4.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
39
. . . . .
. . . . .
51 52 54 57 59
Part 2. Big Data Analytics at the Service of the Sharing Economy .
61
Chapter 5. Beyond the Word “Big”: The Changes . . . . . . . . . . . .
63
5.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2. The 3 Vs and much more: volume, variety, velocity. . . . . . . . . . . 5.2.1. Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2. The variety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.3. Velocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.4. What else? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3. The growth of computing and storage capacities . . . . . . . . . . . . . 5.3.1. Big Data versus Big Computing . . . . . . . . . . . . . . . . . . . . 5.3.2. Big Data storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3. Updating Moore’s Law . . . . . . . . . . . . . . . . . . . . . . . . . 5.4. Business context change in the era of Big Data. . . . . . . . . . . . . . 5.4.1. The decision-making process and the dynamics of value creation 5.4.2. The emergence of new data-driven business models . . . . . . . . 5.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . .
63 64 65 66 67 68 69 70 71 73 74 75 77 78
Chapter 6. The Art of Analytics . . . . . . . . . . . . . . . . . . . . . . . . . .
81
6.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2. From simple analysis to Big Data analytics . . . . . . . . . . . . . . . . 6.2.1. Descriptive analysis: learning from past behavior to influence future outcomes . . . . . . . . . . . . . . . . . . . . 6.2.2. Predictive analysis: analyzing data to predict future outcomes . . 6.2.3. Prescriptive analysis: recommending one or more action plan(s) . 6.2.4. From descriptive analysis to prescriptive analysis: an example . . 6.3. The process of Big Data analytics: from the data source to its analysis.
. .
81 82
. . . . .
84 84 85 87 88
Contents
vii
. . . . . . .
. . . . . . .
90 91 92 94 95 97 97
Chapter 7. Data and Platforms in the Sharing Context . . . . . . . . .
99
6.3.1. Definition of objectives and requirements 6.3.2. Data collection . . . . . . . . . . . . . . . . 6.3.3. Data preparation . . . . . . . . . . . . . . . 6.3.4. Exploration and interpretation . . . . . . . 6.3.5. Modeling . . . . . . . . . . . . . . . . . . . . 6.3.6. Deployment . . . . . . . . . . . . . . . . . . 6.4. Conclusion . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
7.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2. Pioneers in Big Data . . . . . . . . . . . . . . . . . . . . . . . 7.2.1. Big Data on Walmart’s shelves . . . . . . . . . . . . . . . 7.2.2. The Big Data behind Netflix’s success story . . . . . . . 7.2.3. The Amazon version of Big Data . . . . . . . . . . . . . 7.2.4. Big data and social networks: the case of Facebook . . 7.2.5. IBM and data analysis in the health sector . . . . . . . . 7.3. Data, essential for sharing . . . . . . . . . . . . . . . . . . . . 7.3.1. Data and platforms at the heart of the sharing economy 7.3.2. The data of sharing economy companies . . . . . . . . . 7.3.3. Privacy and data security in a sharing economy . . . . . 7.3.4. Open Data and platform data sharing . . . . . . . . . . . 7.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . .
. . . . . . .
. . . . . . . . . . . . .
. . . . . . . . . . . . .
99 101 101 102 103 104 105 106 108 110 111 114 116
Chapter 8. Big Data Analytics Applied to the Sharing Economy . . .
119
8.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2. Big Data and Machine Learning algorithms serving the sharing economy . . . . . . . . . . . . . . . . . . . . . . . . . 8.2.1. Machine Learning algorithms. . . . . . . . . . . . . . . . . . . . 8.2.2. Algorithmic applications in the sharing economy context . . . 8.3. Big Data technologies: the sharing economy companies’ toolbox. 8.3.1. The appearance of a new concept and the creation of new technologies . . . . . . . . . . . . . . . . . . . . . . . . 8.4. Big Data on the agenda of sharing economy companies . . . . . . 8.4.1. Uber . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.2. Airbnb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.3. BlaBlaCar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.4. Lyft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.5. Yelp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4.6. Other cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . .
119
. . . .
. . . .
. . . .
121 122 124 125
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
127 130 131 132 133 134 135 137 139
viii
Sharing Economy and Big Data Analytics
Part 3. The Sharing Economy? Not Without Big Data Algorithms . .
141
Chapter 9. Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . .
143
9.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2. Linear regression: an advanced analysis algorithm . . . . 9.2.1. How are regression problems identified? . . . . . . . . 9.2.2. The linear regression model . . . . . . . . . . . . . . . . 9.2.3. Minimizing modeling error . . . . . . . . . . . . . . . . 9.3. Other regression methods . . . . . . . . . . . . . . . . . . . 9.3.1. Logistic regression . . . . . . . . . . . . . . . . . . . . . 9.3.2. Additional regression models: regularized regression 9.4. Building your first predictive model: a use case . . . . . . 9.4.1. What variables help set a rental price on Airbnb? . . . 9.5. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
143 144 145 146 148 149 150 151 152 152 169
Chapter 10. Classification Algorithms . . . . . . . . . . . . . . . . . . . . .
171
10.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2. A tour of classification algorithms . . . . . . . . . . . . . . . . . . 10.2.1. Decision trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.2. Naïve Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.2.3. Support Vector Machine (SVM) . . . . . . . . . . . . . . . . . 10.2.4. Other classification algorithms . . . . . . . . . . . . . . . . . . 10.3. Modeling Airbnb prices with classification algorithms . . . . . . 10.3.1. The work that’s already been done: overview . . . . . . . . . 10.3.2. Models based on trees: decision tree versus Random Forest . 10.3.3. Price prediction with kNN . . . . . . . . . . . . . . . . . . . . . 10.4. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . .
171 172 172 175 177 179 183 184 185 190 193
Chapter 11. Cluster Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . .
195
11.1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2. Cluster analysis: general framework . . . . . . . . . . . . . 11.2.1. Cluster analysis applications . . . . . . . . . . . . . . . 11.2.2. The clustering algorithm and the similarity measure . 11.3. Grouping similar objects using k-means . . . . . . . . . . . 11.3.1. The k-means algorithm . . . . . . . . . . . . . . . . . . . 11.3.2. Determine the number of clusters . . . . . . . . . . . . 11.4. Hierarchical classification . . . . . . . . . . . . . . . . . . . 11.4.1. The hierarchical model approach . . . . . . . . . . . . . 11.4.2. Dendrograms . . . . . . . . . . . . . . . . . . . . . . . . . 11.5. Discovering hidden structures with clustering algorithms .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
. . . . . . . . . . .
195 196 197 198 200 201 203 205 206 207 208
Contents
11.5.1. Illustration of the classification of prices based on different characteristics using the k-means algorithm . . . . . . . . . . . . 11.5.2. Identify the number of clusters k . . . . . . . . . . . . . . . . . . . . 11.6. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
ix
209 210 213
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
215
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
217
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
233
Preface
It seems that perfection is achieved not when there is nothing more to add, but when there’s nothing left to take away. Antoine de Saint-Exupéry
Welcome to this book! By reading the table of contents, you have probably noticed the diversity of topics we are about to discuss. The purpose of the Sharing Economy and Big Data Analytics, as the title suggests, is to provide a good reference source in these two fields, which have emerged in the digital context. These two fields are so vast that a book ten times larger could not cover everything. From a theoretical and practical point of view, this book adds new knowledge and expands the growing body of literature in the field of the Big Data-based sharing economy. It examines why the new data analysis techniques are needed for sharing economy companies. These reasons are not only related to the implementation of advanced technologies, enabling the capture and the analysis of large amounts of available data, but also to the extraction of value to better guide, optimize the decision-making process and operationalize the company’s various activities. These questions will be primarily based on the following points: – How did the different players give meaning to the concept of sharing? Or, in another way, how have they been able to develop their business models in a way that competes and/or coexists with existing models?
xii
Sharing Economy and Big Data Analytics
– How is the value captured, created and delivered, and how is this value cocreated in regard to the interaction of multi-purpose players in this sharing context? Or, how does the ecosystem of sharing logic work, and what are the impacts, as well as the economic, social and environmental challenges? – What is the role of Big Data and the set of data analysis techniques and algorithms in value creation in an economy based on sharing? How do these algorithms and new analytical techniques affect the ecosystem of the sharing economy? The objective is to show the power of Big Data analytics and explain why it is so important for the sharing economy. Especially if we notice, thanks to the exploration carried out on the whole literature, that a few studies have simultaneously addressed these two fields. This is the full potential of this book, which will serve as a reference for readers in order to understand the role of Big Data analytics as a critical success factor for sharing economy companies. This book therefore discusses the why and the how, in order to help the reader understand these two parallel universes. For each chapter, we have chosen a different aspect which we find interesting, while hoping that these aspects will serve as points of entry into these two exciting universes. Soraya SEDKAOUI Mounia KHELFAOUI October 2019
Introduction
Knowledge is acquired through experience, everything else is information. Albert Einstein
The discussion regarding sharing is very eloquent, it evokes a feeling of kindness and solidarity in each of us. It is used to express a desire to share a part of what we have with others: we can share a meal, an apartment or a car. But can we share everything? “I ask nothing more than to share my sorrows and joy.1” Sharing is an interface between individuals, it concerns both concrete and abstract things. In this way, we can share our pain, our joy, our love and even our life. Sharing helps to alleviate the suffering of people in need who are living in poverty. When we give part of what we have, we eradicate the selfishness that lies dormant within us and awaken hope of our human values in others. The notion of sharing is not a human invention, but a reality that has accompanied man since the dawn of time. It is the result of needing to live in a community and strengthen social ties; because living in a society where sharing prevails strengthens security and trust between individuals. Following the economic crises that shook the world economy after the Industrial Revolution, we became aware of the multi-dimensional nature of these crises
1 Jean-Louis-Auguste Commerson.
xiv
Sharing Economy and Big Data Analytics
(economic, social, environmental) and the value of reviewing the business models adopted so far. “Sharing” has thus become the emblem of the new business model that we tend to call “the sharing economy”. It is defined as an activity to procure, provide or share goods or services via digital technology or, more precisely, “digital platforms”. The sharing economy is part of the sustainable development perspective, the purpose of which is to rationalize the exploitation of underutilized resources or assets. This business model owes its reputation to Internet performance: digital platforms, even if individuals have been engaged in sharing activities for thousands of years. Over time, digital platforms have become more efficient and now govern a wide range of business transactions: C2C, C2B and also B2B. But what about the spirit of sharing in the sharing economy? Criticisms about the sharing economy focus on the lexical amalgam encountered in the activities of start-ups in this field. Thus, when a company like Airbnb is considered as a highly prized example of the sharing economy by the majority of individuals, where is the principle of sharing in the rental or paid reservation of an apartment by this platform? There is controversy over the concept of the sharing economy, because other concepts are related: the collaborative economy, the platform economy, the demand economy, the independent economy, the gig economy, etc. To avoid this confusion, we can say that the sharing economy is about the use of underutilized resources, which are not free of charge, because the shared products or services can be both paid and free. In this context, the spirit of sharing or collaboration is a metaphor that satisfies the beliefs of a new standard of economic thought that is at odds with traditional models. Even if this displeases some people, we will improperly call it the “sharing economy” and the “collaborative economy”.
Introduction
xv
I.1. Why this book? Our most beautiful adventures are our thoughts. Victor Cherbuliez, 1864
This is a relevant question that we had to ask ourselves before embarking on this “delightful” adventure! On the one hand, the idea for this book was born from our personal beliefs. As researchers, it is our responsibility to question and analyze the phenomena that challenge us. On the other hand, and without a doubt, our training courses have a lot to do with the choice of this theme. This is the meeting of two disciplines: environmental economics and data sciences. Therefore, we initially started from the interest of these two disciplines. The environmental economy has undergone a remarkable expansion over the past forty years, coinciding with the rise of ecological awareness. The environmental crisis is the consequence of the natural environment degradation and the well-being of the society. Among the solutions contemplated to remedy this situation, the shared economy stands as a stronghold against the waste of goods and services, and against the overexploitation of natural resources. It is a new business model that reintegrates human values into the business world. This economy revives the spirit of sharing in economic transactions, even in those that are monetized. Monetary compensation remains insignificant when compared to the feeling of satisfaction of having participated in the sustainability of the community. The genius of man is his ability to adapt to new situations by creating the means to adapt to new deals. In the sharing economy context, it is the advent of digital technology, illustrated by research in Big Data analytics which gives rise to digital platforms, that has contributed to the realization of this new model. Thus, Big Data analytics has become an essential instrument for management in all disciplines, including economics. It helps to ensure efficiency in decision-making at both macroeconomic or microeconomic levels. Through this book, we have tried to demonstrate the role that Big Data analytics can play in the evolution of the sharing economy.
xvi
Sharing Economy and Big Data Analytics
I.2. The scope of this book Having knowledge without sharing it, is to put oneself at a level of the one who has no ideas. Thucydide (5th Century BC)
Through this book, we wanted to be forward-thinking by proposing a topical theme of great socio-economic relevance. It is dedicated to researchers in both disciplines: economics and data science. It is also intended for business leaders, managers and investors who want to engage in the sharing economy activities. It also concerns students, or simply those interested in using digital platforms. This book contains information that will satisfy the scientific intuition of researchers and refine the sharing economy and Big Data knowledge of “amateurs”. I.3. The challenge of this book As I change my thoughts, the world around me changes. Louise Lynn Hay
The challenge of sustainable development goes by achieving millennium development goals, the first of which is the well-being of the society. The initiatives undertaken to meet the demands of the society are very diverse and take different approaches. They involve disciplines that seem dissimilar at first glance. This is the ambition of this book! It claims to demonstrate the role of Big Data in the development of the sharing economy. The sharing economy therefore includes digital platforms that connects buyers and sellers of goods and services. However, this sharing and collaboration would not be possible without the data, in addition to the algorithms, that drive the sharing platforms. Data are therefore the driving force behind this economy. These large data-sets produced in various forms and in real-time, forming what we now call Big Data, require sophisticated methods coupled with analytical tools in order to create value. Managing data from sharing economy platforms is therefore a factor that promotes cooperativism, which can maximize beneficial economic, societal and environmental impacts.
Introduction
xvii
Big Data contributes in many ways to the growth of this sharing economy, and we will, through this book, examine the different ways in which this is reflected. By covering a wide range of topics such as Machine Learning algorithms, social innovation and sustainable development, this book provides concrete examples of successful use of the sharing economy, based on the analysis of large amounts of data. And we chose to use Python, a widely used data science language, to analyze the data of the practical examples in this book. This book adheres to a practical implementation perspective for amateurs, but for professionals too. Its purpose is to familiarize you with the different concepts covered. It will provide emerging insights into the theoretical and practical aspects of Big Data analytics in collaborative and sharing applications. It is the result of several years of research in environmental economics and Big Data analytics. It also draws on a data analytics experience within the SRY Consulting start-up, based in Montpellier (France). I.4. How to read this book Science is organized knowledge. Herbert Spencer
The purpose of this book is based on two new concepts: the sharing economy and Big Data. These two concepts are well-known and have generated great interest in research in recent years. Data is at the heart of the sharing economy, as it allows professionals in this context to operationalize their different practices. Exploiting this data and transforming it into value is a high-performance business strategy and this may make it seem complex. However, this is not the case. You can easily evolve your analytical capabilities and create opportunities at each step of your data analysis process. Several studies have dealt with these two concepts, but separately. Thus, in this book, we have tried to combine them and deduce the possible relationship between these two concepts. We have organized each chapter so that each reader can approach it with confidence. The next three sections highlight what is specifically covered in this book.
xviii
Sharing Economy and Big Data Analytics
Part 1: The shared economy or the emergence of a new business model Part 1 discusses a literature review on the theme of the sharing economy. It aims to sketch a congruent overview of the state of the art and the knowledge needed to understand this “phenomenon”. In this part, we will present the sharing economy as an alternative to an outdated economy that is unable to meet the socio-economic and environmental expectations of individuals. It will shed light on the challenges facing the practices of the sharing economy and the possible risks faced by companies and society. Talking about the sharing economy cannot be done without mentioning the technology that has led to the emergence of this new business model, namely digital platforms. The latter provide the foundations of the sharing economy, but also ensure its realization and future. Chapter 1, “The Sharing Economy: A Concept Under Construction”, discusses the emergence of the “sharing economy” concept and its rise to becoming a business model that disrupts the practices of a bygone economy, to the extent that it has created socio-economic and environmental problems. In Chapter 2, “An Opportunity for the Business World”, we look at the consequences of adopting the sharing economy, namely the potential opportunities for the business world, as an optional approach in order to gain a competitive advantage. Chapter 3, “Risks and Issues of the Sharing Economy”, examines the inevitable risks and challenges that the sharing economy covers for economic agents. How would producers perceive the new consumer behavior? Would they be able to embrace it? Or would they be on the defensive? How would individuals, who would see their jobs gradually disappear, react? Chapter 4, “Digital Platforms and the Sharing Mechanism”, highlights the involvement of digital platforms in the process that is revolutionizing the context of sharing in economy. Although there is some reluctance to make this spirit of exchange a reality in commercial transactions through digital platforms, they still play a major role in spreading the values associated with exchange between individuals. Part 2: Big Data analytics at the service of the sharing economy This section presents a platform to examine different concepts related to Big Data and to show how this phenomenon transforms opportunities. It summarizes a considerable set of application examples in the sharing context. This section will be
Introduction
xix
of interest to all types of readers, as it addresses the general context of Big Data and highlights its importance for sharing economy companies. It examines the different ways in which Big Data can be beneficial for these companies. Chapter 5, “Beyond the Word ‘Big’: The Changes”, provides an overview of the basic concepts related to Big Data, while illustrating the changes that have led to this phenomenon. This chapter will also highlight the importance of data for businesses and how it can increase their efficiency. Chapter 6, “The Art of Analytics”, covers the most basic form of data analysis, from descriptive analysis to prescriptive analysis, by covering predictive analysis. This chapter is also an opportunity to discover the different steps of the data analysis process. Chapter 7, “Data and Platforms in the Sharing Context”, presents an overview of Big Data applications. These applications come from the business world and show the interest that large companies take in this phenomenon. This allows understanding the power of data and the importance of digital platforms when realizing the strategies of the sharing economy companies. This chapter thus addresses the different challenges related to data and highlights the importance of data openness. Chapter 8, “Big Data Analytics Applied to the Sharing Economy”, is detailed enough and potentially useful in order to build a global approach around the subject. This chapter demonstrates that the successful implementation of the data-driven culture is an important factor to carry out sharing economy practices. It reviews different Machine Learning algorithms and advanced technological tools for data analysis. Part 3: The sharing economy? Not without the Big Data algorithms This last part presents a range of advanced data analysis algorithms, including regression, classification and cluster analysis. It provides a set of techniques to anyone who wants to generate value from data based on the data analysis process. Using practical examples, we introduce fundamental principles of the Big Data analytics process. Chapter 9, “Linear Regression”, discusses the essential techniques for modelling the linear relationships between explanatory variables and the outcome variable in order to make predictions on a continuous scale. After introducing the linear regression model, there is also a discussion of logistic regression, and Ridge and Lasso regressions. Based on an Airbnb database’s example, this chapter provides a
xx
Sharing Economy and Big Data Analytics
practical guide to explore the data and build a predictive model more efficiently, using Python. Chapter 10, “Classification Algorithms” , revisits the origin of supervised learning and introduces the most common classification algorithms. This chapter is an introduction to the fundamental principles of different techniques that help to better classify data. Using the same Airbnb database, this chapter will illustrate the most significant features in the model definition. Chapter 11, “Cluster Analysis”, moves the focus to another sub-domain of Machine Learning: unsupervised learning. This chapter will examine clustering algorithms, mainly k-means and hierarchical classification. To better understand the principles discussed throughout this chapter, a practical example will be introduced in order to show how to find groups of objects that share a certain degree of similarity. In conclusion, our ambition is to make this book one of the first basic references of the sharing economy practices boosted by Big Data. We hope that this book will open up new horizons for you, by presenting new approaches that you may not have known before. We also hope that this will help you sharpen your curiosity and stimulate your desire to learn more about it.
PART 1
The Sharing Economy or the Emergence of a New Business Model
Sharing Economy and Big Data Analytics, First Edition. Soraya Sedkaoui and Mounia Khelfaoui. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.
1 The Sharing Economy: A Concept Under Construction
What cannot be eschew’d must be embraced. William Shakespeare The Merry Wives of Windsor, Act V, Scene 5
1.1. Introduction For more than 20 years, society’s behavior has changed in terms of daily consumption. In the interest of economic rationality, people, without knowing each other or being part of the same family, agree to live together, travel in the same vehicle, work in the same space and participate in common projects. They decide to join forces in activities such as production, distribution and consumption. In other words, they have a new approach to economics, which involves collaboration and a spirit of sharing: it is commonly referred to as “the collaborative economy” or “the sharing economy”. A question challenges us: why did this concept appear? Several factors have contributed to the advent of the sharing economy: the development of IT tools and mobile technology (smartphones, tablets), the globalization of the economy, the global economic crises and the environmental watch fueled by ecological awareness of the negative externalities of economic activity.
4
Sharing Economy and Big Data Analytics
Indeed, we are witnessing a deterioration in the efficiency of the resources used. Demailly and Novel (2014) agree with this idea, arguing that: “For those in the sharing economy, it is nothing more or less than an underutilization of material goods, capital, and therefore an economic and environmental waste.” We cannot consider the theme of sharing as recent, since the exchange of goods and services has existed, in many different ways, for a very long time: agricultural mutual cooperatives, insurance mutual and agricultural cooperatives already existed before the 21st Century. In other words, the idea of enjoying the good or service without owning it was part of people’s consumption strategies. Also, the expansion of the sharing economy responds to the contemporary demands of people concerned about the quality of the environment and the degradation caused by the fierce economic activities on natural resources since the Industrial Revolution. The sharing economy is a social phenomenon, in that it has allowed ordinary people to use their surplus resources in many forms and to consume in a mutual and altruistic way. It is also environmental, because the sharing economy makes it possible to absorb excess capacity (infrastructure needs, car needs, etc.), reduce the exploitation of natural resources, and meet the needs of those that are most disadvantaged in society. So, what is the “instrument” that has enabled the expansion of this economic phenomenon? The notion of the sharing economy is in its conceptualization phase and does not yet have a consensus on its definition. It only dates back to the early 1990s. Indeed: Collaborative economics is a concept carried by successful essayists such as J. Rifkin, R. Botsman or F. Turner, activists like Mr. Bauwens, media like Shareable in the United States, think tanks and do tanks like OuiShare in France, or by some entrepreneurs and start-up managers, and even more and more large companies. (Massé et al. 2015) The conceptualization of the sharing economy is incipient, its advantages and disadvantages are not yet well defined. Although it is well received, it does not have an epistemological definition. In an ironic note, Bostman says: “The sharing economy lacks a shared definition.”
The Sharing Economy: A Concept Under Construction
5
The sharing economy has become a global phenomenon. Thus, can it represent an economic model that competes with the traditional economy? What are the advantages of this new economy? Will it contribute to the sustainability of the economy? 1.2. From simple sharing to the sharing economy The act of sharing is natural for us, it responds to a tangible satisfaction (exchange of goods and services), but also to self-satisfaction, to the extent that we put our humanity into practice in the relationships that unite us. Human beings learn to share from an early age. As time goes by, the sense of ownership gives way to a sense of sharing. Moreover, it is religious beliefs that have inculcated the desire to escape the grip of self-centeredness within us. Monotheistic religions call for sharing through their main foundations and practices. Thus, the concept of sharing is interpreted in society in several ways: cooperatives, mutual associations, volunteering, etc. The spirit of sharing was embodied in the development of digital technology, which has grown tremendously in recent years: In cities, new digital technologies are revolutionizing the way we use transportation, housing, goods and other services [...] The sharing economy has disrupted virtually every sector, creating a multitude of markets based on platforms that connect individuals, businesses and communities. (Hodkinson 2017) To understand the shift from the concept of “sharing” to that of the sharing economy, let us immerse ourselves in the history of its origin, its foundations and its practices. 1.2.1. The genesis of the sharing economy and the break with “consumer” society Etymologically speaking, the word “to share” comes from the Latin partes agere: partes means “to make equal parts” and agere means “to push” and “to activate”. As a result, the concept has existed since antiquity and combines the two meanings: to divide an entity for a specific purpose. This can be a sharing of inheritance to benefit all heirs, or a sharing of power to delegate certain responsibilities.
6
Sharing Economy and Big Data Analytics
In the notion of the “sharing economy”, each of the two words has kept its specificity. Sharing always means leaving a part of something that belongs to us to one or more people, no matter how this action is carried out. It is the combination of the two words that makes the difference. Indeed, the sharing economy has become the apanage of the solidarity economy and has experienced a staggering growth in recent years thanks to the development of information technology and mobile technology. Via software, or more precisely platforms, business transactions and meetings between suppliers and service providers are growing exponentially. The term “sharing economy” was first added to the Oxford Dictionary in 2015. Similarly, the scientific literature on the concept is relatively new. Researchers looked at different aspects of the sharing economy using different names. Among the names used, we can mention: – peer-to-peer economy; – collaborative economy; – collaborative production and collaborative consumption; – access economy; – consumption based on access; – local economy; – peer production based on commonality and mesh size; – product-system service and on-demand economy; – wholesale economics; – platform economy. Nevertheless, the term “sharing economy” is the most commonly used term in the literature (Ranjbari et al. 2018). “We believe that the economy of sharing can be the defining story of the 21st Century if we come together to build it.” These are the words of Natalie Foster, cofounder and executive director of Peers.org. The defenders of this ideology want to fight overconsumption and waste, create social equity and make the shared economy the “economic model of the third millennium”.
The Sharing Economy: A Concept Under Construction
7
The sharing economy wants to break with the consumption practices that have occurred since the Industrial Revolution, where consumers are forced to consume excessively. It began after the Second World War and reached its peak in the years known as the Glorious Thirty. Consumer society was born with the advent of Taylor’s theory and the launch of assembly line work. This process has resulted in an abundance of products and reduced prices, which have become accessible to all segments of society. The bewitching power of advertising must be added to this, which manipulates the consumer at the whim of producers: Advertising is the main instrument of manipulation, in fact, it conditions, brings a subliminal perception and influences the population subconsciously. Over the years, it has mobilized mental manipulation techniques from human sciences to encourage consumers to buy.1 Without question, the consumer society has allowed man to satisfy his every desire for goods and services and to live in unavoidable comfort. A vehicle for moving, water and energy brought to your home, sophisticated objects for entertainment and many other gadgets. Nevertheless, it has had negative consequences, which have been denounced since the 1960s, in particular by George Katona, inventor of the term “consumer society” and initiator of consumer psychology. At that time, “markets were not yet globalized; private and public spheres of life were not as commercialized as today; and the information and digital communication society was not yet born” (Reisch 2008). The alternative to the consumer society was the unrelenting collaborative society, which aims to deal with the hyper-consumption that has governed consumer behavior for several years. The pivot of this new trend is “collaborative consumption”, which is defined as “the set of resource circulation systems that allow consumers to use and provide valuable resources temporarily or permanently, through direct interaction with another consumer or through a mediator” (Decrop 2017). The concept of collaborative consumption was initiated by Felson and Spaeth in 1978, long before the boom in mobile technology and social networks, with a different objective than the one that is “now attributed to the concept” (Ertz 2017). 1 Available at: http://tpe-societe-de-consommation.e-monsite.com/pages/tpe/iii-li mite-etcritiques-de-la-societe-de-consommation-1.html.
8
Sharing Economy and Big Data Analytics
Collaborative consumption has changed the subjective conception of the value of goods and services, well known in the currents of economic thought. Indeed, collaborative consumption is based on a new principle of “access/use rather than purchase/ownership of goods, capital or services”, by pooling their personal resources (Decrop 2017). Collaborative consumption is considered an economic model, in that it has brought about change in society by using old values, such as giving and altruism. By affirming that collaborative consumption is the ultimate alternative to excessive consumption, Botsman and Rogers (2010) use examples of companies “within this model” in order to highlight the specificities of the different practices and trends of this new consumption model (Bicrel 2012). This view is shared by a large number of authors who argue that collaborative behavior is increasingly becoming a socio-economic option for the economic crises of the capitalist model and a possible response to environmental problems. The “abandonment” of abusive consumption behavior is imminent, and even if it is does not happen today, it is starting to seep into individuals’ morals. It is now the objective of the collaborative economy. 1.2.2. The sharing economy: which economy? Sharing, in the formal sense of the word, means “to divide a whole into several pieces in order to own it with others”. But in this case, sharing means “to take part in a whole or to benefit from a part of a service or good”. The representation of the economy according to the sharing orthodoxy revisits the economic reality. Does it simplify it? Or does it complicate it even more? We are not ready to decide, but we can argue that this is a new “identification” or “modeling” of the economy. The characteristic of this model lies in the novelty of the process of sharing goods and services. Indeed, exchanges follow a horizontal and decentralized trajectory between individuals, by modifying the redistribution system (Penn 2016). The sharing economy embodies a new socio-economic system based on the exchange of goods (vehicles, housing, tools, etc.) and services (carpooling, takeback, etc.) between individuals. It may involve profit-based transactions, in other
The Sharing Economy: A Concept Under Construction
9
words, there is a monetary exchange, or it may involve giving, bartering or volunteer activities. But what makes the sharing economy famous? The thing that offers all these opportunities for exchange and makes this new business model famous is a technical instrument: the “digital platform”. The sharing economy is thus also called the “platform economy”. It is defined as being constituted by trades between “peers with platforms, acting as brokers between them” (Nicot 2017). Platforms ensure the success of the sharing economy because they contain several features: First, the platforms provide an algorithm for efficient matching between labor providers and users. Second, technology reduces transaction costs because platforms can also facilitate microtransactions. Third, platforms provide services to reduce or manage the risks associated with market transactions, thereby addressing market failures, in other words, incomplete or erroneous information. (Drahokoupil and Fabo 2016) The sharing economy is not just a fashionable phenomenon. Today, an increasing number of companies depend on the intensive use of digital platforms, which allows for an easy relationship between supply and demand, in addition to the trust required by users. Even materialistic consumers, who are more inclined to own things, are attracted to the sharing economy, and projections show that the main sharing sectors, namely carpooling, online staffing, music and video streaming, finance and housing have the potential to increase global incomes from about $15 billion to about $335 billion by 2025. Such exponential growth is a reminder of the importance of the subject in both theory and practice (Ranjbari et al. 2018). It also creates jobs, can support the ecological cause (shared resources, reduction of negative effects) and can remedy excessive consumption (Torfs 2016a). It contributes to the achievement of the objectives of sustainable development, namely the sustainability of the economy. Like any economic model, the sharing economy draws on basic elements to build its theory, through the formalization of the economic phenomenon and hypotheses, so as to identify its functioning and the scope of its model.
10
Sharing Economy and Big Data Analytics
What are the foundations of the sharing economy? What does the technical nature of digital platforms give it? Certainly, technology has attributed a particularity to the sharing economy in terms of conceptualization, but it remains a concept based on currents of theoretical reflection. 1.3. The foundations of the sharing economy The sharing economy is a multidisciplinary concept supported by socioeconomic partners, entrepreneurs and ecological activists. A consensus on the theoretical foundations of this concept does not yet exist. However, defenders of this economy have mobilized a review of theoretical thinking in order to identify collaborative practices: the free market economy and P2P economy, the gift economy and the service economy (Borel et al. 2015). These three elements will be addressed in the following sections. 1.3.1. Peer-to-peer (P2P): a revolution in computer networks The advent of the Internet has changed the way humans behave in relation to their ways of consuming. The prowess of IT has not only shaken up “the way in which purchases are made between online sales professionals and a consumer. Web 2.0 and its myriad of social networks are changing our relationships with others and also our way of consuming” (Evroux et al. 2014). Unlike web 1.0, web 2.0 is a platform that provides more opportunities for disseminating and sharing information. The consumer is no longer just a passive consumer, but becomes active and participates in the “making” of information. Box 1.1. The social web
The tremendous expansion of the social web is embodied in peer-to-peer (P2P) technology, with the initial sharing of media, music and video files. The proper meaning of the word peer-to-peer is “node-to-node”. In this process, the interconnected nodes share resources with each other without using a centralized administrative system. In other words, each node of the computer network is both client and server, unlike the old client-server model.
The Sharing Economy: A Concept Under Construction
11
Figure 1.1. P2P diagram (Evroux et al. 2014). For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
The success of the P2P system as a social phenomenon has important economic (rights and taxes) and moral (knowledge exchange) consequences. It allows efficiency in the use of networks and economic performance (reduction of infrastructure costs and exploitation of the high inactive potential at the edge of the Internet (Benoît-Moreau 2006)). The P2P network was mainly known under the Napster2 brand. In this application, the P2P network concept was used to share media files, in other words, the exchange of compressed MPEG Layer3 (mp3) audio files. However, P2P is not only about file sharing, it is also about establishing multimedia, video, document and cryptography communication networks based on resource sharing (Schollmeier 2001). Since its invention, P2P technology has been defined in several ways by computer theorists and professionals. Michel Bauwens3 describes P2P as “a form of network-based organization, based on the free participation of equipotent partners engaged in the production and use of common resources”. Peer-to-peer does not use financial compensation as the main motivation, and it does not use traditional command and control methods: 2 This music store later became the pioneer of P2P technology and specialized in sharing multimedia files. 3 Michel Bauwens is a Belgian computer scientist, P2P theorist, author and lecturer on innovative technological and cultural topics (source: Wikipedia).
12
Sharing Economy and Big Data Analytics
It creates something common rather than a market or a state, and relies on social relationships to allocate resources, rather than a price mechanism or hierarchical system. (Evroux et al. 2014) Bauwens’ definition considers the P2P system as a non-profit social organization, its only ambition is to share data between individuals, whether they are producers or users. It follows a protocol created by society and for society, without there being a system of supervision, and it tries to moralize the use of technology. The Intel Working Group defines P2P as “the sharing of IT resources and services through direct exchange between systems”. Alex Weytsel of the Aberdeen Group considers P2P to be “the use of devices on the outskirts of the Internet without the customer’s ability”. Ross Lee Graham identifies P2P according to three main requirements: possession of a server-quality operational computer, registration at an independent address, and the ability to cope with flexible connectivity (Milojicic et al. 2002). Clay Shirky of O’Reilly and Associates is based on the following definition: P2P is a class of applications that takes advantage of the resources – storage, cycles, content, human presence – available on Internet devices. (Evroux et al. 2014) In this case, Shirky summarized the main operational elements of P2P into three concepts (Koulouris 2010) in 2001: – Presence: this includes the ability to say when a resource is online. Determining the presence of a resource is necessary for P2P networks, as the permanent availability of resources is not guaranteed. Once a user or resource’s online presence is established, any number of highly personalized services can be offered. Presence is essential for the creation of user-centric systems, such as instant messaging. – Identity: P2P networks must be able to uniquely identify available resources. The identification systems that were used were not suitable for machines permanently connected to the Internet and users without a stable IP address (see Box 1.2) could not be recognized or identified. Thus, P2P technology solves this problem by using its own naming system. – P2P networks make it possible to use the resources available at the “limits” of the Internet: processing power, storage, content, human presence. This is in contrast to current client/server services, where usable resources are concentrated in servers: the “core” areas of the Internet. P2P services organize a grouping of resources of variable size belonging to the participants and allow them to use it collectively.
The Sharing Economy: A Concept Under Construction
13
“An IP address is simply a way to identify a device connected to a computer network. Just as a postal address is used to receive mail at home, an IP address is used to send and receive data between devices connected on the same network. Without an IP address, millions of computer devices would not be able to communicate with each other.” Box 1.2. IP4 address
Lastly, Kindberg believes that P2P system resources have independent lifespans (Deepak and Sanjay 2005). All these definitions have outlined a profile of P2P technology, each with a specific approach. However, they all agree on two main characteristics. The first is that P2P technology can evolve, as there are no algorithmic or technical restrictions on system size. The second is that it is reliable, because the malfunction of a given node does not affect the entire system (Ding et al. 2005). It offers a wide range of opportunities and paves the way for initiatives and innovations in the field of the sharing economy. P2P is a technical support and an asset to materialize exchanges between people. It is certainly essential, but it is not enough to make the sharing economy a realistic alternative to hyper-consumption, in terms of an economic model. 1.3.2. The gift: the abstract aspect of the sharing economy There are subjective aspects inspired by ethics and human value that are a substantial base for sharing between people. The “gift” is one of these aspects. It is the act by which a person voluntarily disposes of a good or service for the benefit of another person (or organization), without financial compensation. If there are no monetary benefits, can we even be talking of an economic act? The question seems absurd. Most of the time, gifting is not approached from an economic perspective, but rather from a socio-philosophical point of view. Thus, “a gift is a privileged object of anthropology and economic sociology since the essay on gifting by M. Mauss”5 (Athané 2008). 4 Source: https://www.justaskgemalto.com/fr/une-adresse-ip-cest-quoi/. 5 Marcel Mauss, who was born in 1872 and died in 1950 in Paris, is considered to be the father of French anthropology. His work has focused, among other things, on the meaning of giving in tribal societies. For Mauss, giving is essential in human society and has three phases: the obligation to give, the obligation to receive and the obligation to give back. His famous work is entitled The Gift: the Form and Reason for Exchange in Archaic Societies.
14
Sharing Economy and Big Data Analytics
“[...] You will then have a fairly good idea of the kind of economy that is at present laboriously in gestation. We see it already functioning in certain economic groupings, and in the hearts of the masses, who possess very often better than their leaders, a sense of their own interests, and of the common interest. Perhaps by studying these obscure aspects of social life, we shall succeed in throwing a little light upon the path that our nations must follow, both in their morality and in their economy.” Box 1.3. The gift according to Mauss
These disciplines try to explain the embedding of economics in the norms that control social relations. Embedding in Polanyi’s6 sense refers to the inclusion of the economy, as a means of satisfying human needs, in political and cultural orders that govern certain forms of movement of goods and services (Carvalho and Dzimira 2000). Much research has focused on the issues of the role of giving in a business and the market value of the deed of giving. In the economic context, gifting is subtly integrated into the behavior of economic agents. “For many economists and managers, it is a social practice, of an emotional or moral nature, that is beyond their area of competence” (Gomez et al. 2015). The logic of giving is an exception in commercial society. It is the opposite of merchant exchange, it is one-way. The donor does not expect any consideration when he decides to bequeath ownership of his property (Lasida 2009). For some, to evoke giving in economics is an absurdity, they believe that several obstacles stand in the way of the gift being able to design a new economic model. For them, an economic good (or service) has an immediate value in use, in other words, it must have a monetary value to be exchanged. However, if the issue of giving in the economy or market economy is addressed in a social context, it is likely to foster the social relationships that have developed in the market (Lasida 2009). In their book on giving in a company, Gomez et al. (2015) argue that giving and free donation are found throughout the company and in markets:
6 “Karl Polanyi (1886–1964), a writer and teacher of Hungarian origin, lived in Central Europe and Great Britain before emigrating to the United States during the Second World War. He is the author of a brilliant and powerful critique of the liberal tendency to place the market at the center of human nature and society, a phenomenon that we could call ‘commercial fundamentalism’” (Hart 2008).
The Sharing Economy: A Concept Under Construction
15
The helping hand to a colleague, the return of an elevator, the free transfer of information, corporate gifts, free discounts to the customer, the service provided without being obliged to do so, etc. This approach will help us to explain the “presence of the gift” in economic activity. Indeed, if we are in the new economic sociology, the market, in addition to being the (geographical or virtual) place where the supplier and the demander meet, “characterizes a specific form of social relationship: one in which prices determine relationships to things and individuals, even if these prices result from a struggle between agents before the results of this struggle are imposed on them” (Steiner 2012). Economic sociology considers the market as a social structure. Steiner (2005) uses Swedberg’s (Swedberg and Smelser 1994) thought process to support this idea: My main objective, however, will be to examine markets from a particular perspective, such as a specific type of social structure. Although social structure can be defined in different ways, this term is generally understood as a kind of recurrent and structured interaction between agents, maintained by means of sanctions. Thus, he argues the following: Contemporary economic sociology is interested in the origin of this social structure, in other words, the rules that allow it to function; it studies its different forms and researches the reasons for their evolution. This is now called the social construction of markets. Moreover, by highlighting the shareholder aspect, contemporary economic sociology is led to consider the question of self-interested behavior in this constructivist framework. The considerations of economic sociology confirm that the economic aspect is “always socially situated”. They reaffirm that economic exchange depends on people-to-people relationships and “extra-economic factors” (Pascal 2002). In other words, economic sociology integrates social consideration into economic activity by using concepts illustrating free trade, namely “giving and the social and solidarity economy”. It should be noted, however, that the paradigm of gifting and the solidarity economy current does not have a consensus. According to Steiner in 1999, they are “either excluded from the field of economic sociology”, or according to Levesque et al. in 2001, they are “included in it” (cited in (Lasida 2009)).
16
Sharing Economy and Big Data Analytics
With reference to Godboult’s book (1992), giving finds its place in the economic sphere. The author integrates it into the business and corporate world with finesse. It certifies that the gift is for the “service of the circulation of things, the sale and disposal of products”. Godboult’s latest reflection is based on Carnegie’s work, particularly his book7 on the personal development method adapted to the business world, published in 1936. He gave the giving formula to the business community and the market. In her book, Carnegie shares the lessons learned from her extremely successful experience of her years of leadership (Godboult 1992). These lessons do not focus on power theft or career research. On the contrary, he discovered that to become more powerful and strengthen business, people had to be treated with kindness alone. The author suggests that the businessman should give before he hopes to receive. He suggests that the businessman be sincere by offering gifts before the merchandise, because the disinterest he shows is felt through the way he gestures, looks or simply does things. Carnegie’s book describes the attitude that the businessman must adopt towards people. His behavior must emanate from the human values governing social relations (loyalty, faithfulness, impartiality, transparency, enthusiasm, team spirit) and the utilitarianism that prevails in the business world must kindly integrate the logic of giving. The interest that giving theory evokes in economic sociology is not the same as that in economics and management. In these disciplines: theories such as liberal theory, the neoclassical business model and the utilitarian paradigm exert an overwhelming dominance, both in scientific publications and in educational programs at universities and business schools. (Masclef 2013) Thus, gifting remains a marginal phenomenon that does not reflect the reality of utilitarianism and exchange. In this context, Godboult (1992) indicated that two economists, François Perroux (in 1960) and Serge Christophe Kolm (in 1984), specified three complementary economic systems: the market system, governed by interest, the planning system,
7 The book is entitled How to win friends & influence people. He initiated three lessons to improve personality: think beyond yourself, be engaged and interested, empower and encourage.
The Sharing Economy: A Concept Under Construction
17
governed by constraint, and the gift system, without considering the gift as an economic system. The author unfurls his analysis by stating that: The giving system is not primarily an economic system, but the social system of person-to-person relationships. It is not a complement to the market or the plan, but a complement to the economy and the State. And it is even more fundamental, more important than they are, as the example of disorganized countries shows. In the East, or in the Third World, where the market and the State are unable or no longer able to organize themselves, there ultimately still remains a network of interpersonal relations cemented by gift and mutual aid which, alone, makes it possible to survive in a world of madness. Also, some analyses show that social exchanges do not oppose economic logic, on the contrary: they give it meaning (Alter 2010). Material exchanges between individuals intrinsically involve social ties and can lead to non-reciprocal economic relations. That is to say, to give without waiting for a material counterpart. In other words, in its quest for efficiency, economic activity, particularly business activity, is imbued with philanthropic and altruistic social values and is not focused on making profits. Some global economies have understood the role of social values in the consolidation and sustainability of their business. Indeed, several researchers have been interested in the Japanese model in this context. Japanese companies are drawing on their social culture, which is based on nine concepts: National sentiment (pride), honor, reliance on virtue, respect for tradition (weight of history), respect for authority, social conformity, consensus (harmony), sense of work (taste for effort) and pragmatism. (Verne and Meier 2013) These virtues give Japanese companies a specific character. The manager and employees believe in these values and act in a collective spirit of cooperation, commitment and firmness. Companies are proud of their role in society by contributing to the “well-being and prosperity of employees, customers and suppliers, in addition to the community as a whole” (Rolland-Piégue 2011). In doing so, Japanese companies strengthen solidarity in their communities and awaken the spirit of sharing and giving in each of the stakeholders.
18
Sharing Economy and Big Data Analytics
Thus, the gift, as an economy that supports exchange between people, lays the foundations for an economy based on social norms. 1.3.3. The service economy and the offer of use Among the theories that have been mobilized to explain the purpose of the sharing economy, is the service economy. It is an economy that offers an alternative to excessive consumption of products and favors the use of a good over its ownership. The concept of the “service economy” appeared in 1986 at the initiative of Walter Stahel and Granini (Bourg and Buclet 2005). “The service economy, sometimes also called the ‘performance economy’, is a concept that was developed by Walter Stahel in the 1980s and was taken over by Dominique Bourg in the 2000s.” Box 1.4. The service economy (ADEME 2017)
It was approached by two theoretical schools. The Anglo-Saxon School, whose thought processes are based on the Product Service System (PSS) model. The main contributors to this school are Walter Stahel, Buclet, Okasana. The second school is the École française de l'économie de la fonctionnalité et de la coopération (French school of service economy and cooperation or EFC), whose ideas are mainly supported by Christophe Sempels and Christian Du Terre (Chaput 2015). Stahel (2006) testifies that: The service economy, which aims to optimize the use – or function – of goods and services, focuses on the management of existing wealth, in the form of products, knowledge or natural capital. The economic objective is to create the highest possible usage value for as long as possible, while consuming the least amount of material resources and energy possible. The aim is to achieve greater competitiveness and an increase in corporate income. (Van Neil 2007) It is also defined by ADEME (Agence de l'environnement et de la maîtrise de l'énergie – Agency of the environment and the control of energy) as the economy which “consists of providing companies, individuals or territories with integrated solutions for services and goods based on the sale of a performance of usage or a usage and not on the sale of goods alone. These solutions must allow a lower
The Sharing Economy: A Concept Under Construction
19
consumption of natural resources in a circular economy perspective, an increase in people’s well-being and economic development”. The basic idea of this concept involves the new view of the economic value of goods. While the traditional approach to economic and measurable character considers its cost, supply and demand, the contemporary approach, inspired by the principles of sustainable development, considers that economic value practically doesn’t include the damage inflicted on the environment by the production and overconsumption of this good. Therefore, in the absence of integrating negative externalities (of the production and use of the good), the value of the good no longer concerns its appropriation, but rather its use. The service economy is a new business model, as it reorganizes the relationship between the market, the company and the consumer. Indeed, since the advent of industrialization, the focus of economic activity has been on large scale production and consumption. The objective of this economic situation was to produce more in order to satisfy consumption while constantly changing goods and services: For a company, the growth of its markets goes hand in hand with the increase in number of units produced, related to the acceleration of the redundancy of its products. This redundancy is frequently ‘programmed’, either by influencing the material lifetime of products designed to no longer be repairable, or by stimulating consumers to equip themselves with the latest innovations, few of which constitute radical innovations responding to new needs and/or uses. (Buclet 2005) However, the era of industrialization is no longer the representative of dream of the Trente Glorieuses, the 30-year period following the end of World War Two. According to Fordist doctrine, mass production for mass consumption is no longer perceived as an advantage for the well-being of people: For many economists (Gadrey in 2002, Gadrey and Zarifian in 2002, Du Tertre in 2008 and 2009), overcoming the Fordist crisis implies the implementation of a new form of economic organization based on a service-oriented dynamic approach that they must present as a central phenomenon in current economic and societal developments. The idea would no longer be to buy the good you need, but to rent the service provided by its consumption. (Vaileanu-Paun and Boutillier 2012) The principle of this new model is to change the relationship between the company and the consumer. This does not mean that the objectives of the company or the consumer have changed. Consumers are always looking to satisfy their well-
20
Sharing Economy and Big Data Analytics
being by considering the new situation (economic crisis, ecological crisis, etc.). The company also thinks about its profitability interest by remaining the owner of the assets and selling their uses (Buclet 2005). Practices in the context of this logic are adopted by several well-known companies around the world: Michelin, Xerox, Amazon, Apple, Microsoft and many others. “The service economy is very well illustrated by two companies: Xerox and Michelin. Xerox is a company that manufactures and distributes office automation equipment (photocopiers in particular). Faced with resistance from its customers to invest in new expensive photocopiers, Xerox decided to no longer sell the machines, but to invoice each photocopy made. The same principle applies to Michelin, which, by observing resistance from its customers to the prices of new, more fuel-efficient tires, has decided to no longer charge for tires, but for the kilometers covered by its customers. Both companies no longer sell the products (photocopiers or tires), but rather the use of these two goods (photocopies or kilometers).” Box 1.5. Xerox and Michelin’s service economy
It should be noted, however, that the development of the service economy depends on the prowess of the Internet, and we have previously demonstrated the role of digital platforms in economic activity. Indeed: The Internet is not only the technical support for the transformations that give rise to the digital economy, it is also the driving force behind this new economy. (Brousseau and Curien 2001) The service economy is considered as a new economic model that can change the relationships between economic players. The question that challenges us is therefore: to what extent can this new economy contribute to the well-being of populations on an economic, social and environmental level? The service economy is an economic model based on the substitution of the use of a good for its ownership, it is a solution that promotes the concept of “decoupling” between the added value created by economic activity and the consumption of energy and raw materials. “Decoupling is the breaking of the links between ‘environmental ailments’ and economic goods” (OECD 2004).
The Sharing Economy: A Concept Under Construction
21
Decoupling is divided into two categories: Relative and absolute decoupling. Relative or weak decoupling occurs when environmental pressures increase at a slower rate than economic growth; and absolute decoupling or strong decoupling occurs when the economic variable increases while environmental pressures stagnate or decrease. (Laurent 2011) Conventionally, economic growth refers to the substantial increase in output measured by the macroeconomic aggregate known as GDP (gross domestic product). This increase also requires the intensive use of natural resources and energy, leading to environmental degradation through the overexploitation of resources and the formation of waste. The interest of decoupling lies in reversing the relationship between production and use of natural resources and energy. Indeed: Decoupling occurs when the wealth created (usually measured by gross domestic product) in an economy increases faster than the amount of natural resources used or consumed. The decoupling can be relative or absolute: in the first case, the amount of resources used or consumed continues to increase. In the case of absolute decoupling, the use or consumption of resources decreases while the wealth created increases. (Nicklaus 2017) The service economy model novelty is that the creation of added value is not based on the company’s overall turnover, but on the optimal use of the good (Damesin 2013). The objective of the service economy is to improve the well-being of populations. However, the concept of decoupling is considered in some research as a “myth”. Tim Jackson’s book sheds light on the impossibility of establishing an absolute decoupling between economic growth and the negative externalities generated by resource consumption. He argues that relative decoupling can be achieved with technological advances, but that absolute decoupling is out of reach, and would even be a dangerous illusion (Laurent 2011). Companies therefore request relative decoupling. According to Jackson, absolute decoupling is out of reach and it can even be a dangerous pipe dream (Laurent 2011). As a result, the decoupling between economic growth, environmental impacts and resource consumption is imminent, even if pessimistic researchers persist. Despite this pessimism, relative decoupling is requested by companies because there is less pressure on them.
22
Sharing Economy and Big Data Analytics
The service economy is thus defined and has been mobilized according to three referential theories, namely the service economy and cooperation (EFC), Product Service Systems (known as PSS) and the functional economy (this theory has been defined in the previous sections) (ADEME 2017). The approaches used to explain the service economy have identified two economic models that are at odds with the industrial model of intensive growth. The “service-based” model and the “life cycle” model. The service-based model marks the advent of an economic rationale based on the value in use of the good, whose added value is calculated in relation to the sale of the service derived from this good and not on the basis of the production and sale of this good. In this context, ADEME (2017) defines this model as follows: This logic corresponds to the development of service and customer relations that is focused on the useful effects and performance of usage of the resolution, mainly by valuing the intangible resources on which the company’s activity is based (skills, trust, organizational relevance, etc.). It mobilizes beneficiaries, industrialists, local authorities and citizenconsumers in a co-production and long-term commitment dynamic. Innovation covers all aspects of the business model. This serviceoriented logic aims to increase the value created and the quality of the offer, by eliminating the logic of production in volume associated with the reduction of unit costs. To complete this model, which does not particularly consider the product life cycle, the life cycle model is used. The purpose of this approach is to improve the environmental performance of products: It allows products to be designed differently by taking their environmental impacts throughout their life cycle into account. Thanks to this new look at products, this approach makes it possible to generate new ideas and be creative. (ADEME 2012) The novelty in this concept is in “the technological evolution of the goods that are made available, in this case the extension of the lifespan of goods, and in the logistics put in place to ensure the closure of the physical flows of goods and materials. It requires the company’s business model to evolve” (ADEME 2017).
The Sharing Economy: A Concept Under Construction
23
While the two models (service-based and life cycle) seem to be complementary, they nevertheless present some disparities. Thus, the service-oriented logic aims for the performance of the use of the good and not just its availability in the hands of the customer. “Service-oriented logic is based on the company’s intangible assets.” [This] intangible capital is essentially linked to the company’s competencies (professionalization of people, knowledge, know-how, etc.), the relevance of the company’s organization and offer (intangible and technological R&D, marketing, relevance of the integration of goods and services), trust between stakeholders (cooperation, reputation, internal and external communication) and the health of workers. (ADEME 2017) The value of intangible factors in the company determines the success of a service strategy in the context of the service economy. As for the life cycle logic, the company undertakes selling the use of the good, while remaining the owner. In this case, the company will not adopt a planned strategy of redundancy. It will be responsible for the maintenance of the assets, particularly if it manufactures them itself and manages them throughout their life cycle: Extending the life of assets is not the only potential benefit. In theory, a looped management of assets will be facilitated. Manufacturers, retaining ownership of the products, will be able to reuse components and recycle materials to make new products. These practices will be encouraged all the more as the costs and risks of shortages of virgin raw materials are high. (ADEME 2017) Even if you can approach the service economy with both logics, the life cycle logic is similar to the notion of the circular economy, unlike the service-oriented logic which has a tendency to replace the service economy. Basically, the economy of sharing has been introduced by three theories: the free market and P2P economy, the giving economy and the service economy. The theoretical foundations of collaborative practices can also be placed in a religious context. Monotheistic religions all agree on the principle of sharing between humans and consider that giving part of one’s goods to the other is an act of faith emanating from a sensitive and virtuous soul.
24
Sharing Economy and Big Data Analytics
1.4. Conclusion The sharing economy, a trend that can change everything in its path, seemingly unnoticed. Based on horizontal management, it encourages sharing and stimulates community relations in business transactions. It is increasingly not only attracting an informed population concerned about the impact of economic activity on the environment and social well-being, but also a young population that is “connected” to new IT trends. The propensity of this generation to share is not a fluke nor a coincidence. In short, today’s companies, operating in a sharing economy, are making their way to glory and sustainability. TO REMEMBER.– The sharing economy is really an economic model and disrupts the capitalist model. The latter has created a crisis that has gone global at the same pace as the globalization of the economy. This crisis is multidimensional: economic, financial, social and ecological. What the sharing economy brings to sustainable development are global and integrated solutions.
2 An Opportunity for the Business World
‘Sharing’ is an uncommon usage in the economics literature, though it is common in some of the anthropology literature. I choose it because it is broader in its application than other, more common, but narrower terms for associated phenomena—most importantly, ‘reciprocity’ or ‘gift’. I hesitate to use ‘reciprocity’ because of its focus on more or less directly responsive reciprocated reward and punishment as a mechanism to sustain cooperation in the teeth of the standard assumptions about collective action. Benkler (2004)
2.1. Introduction The concept of a sharing economy is becoming increasingly important and is being finely integrated into societies. Although not explicitly used, it is implicitly practiced because it is based on social relations and religious ethics in most cases. The new production method that characterizes sharing is independent of the price that has governed business transactions to date. It is more efficient than price-based and government-funded systems (Benkler 2004). The success of sharing practices is attributed to the fact that they are carried out between individuals who are unrelated (in other words, individuals who do not know each other), but who still form an effective large-scale exchange system (Benkler 2004). As far as the sharing economy is considered an economic model, what will be the implications of this economy? The scientific debate on the implications of the shared economy is based on issues such as new business models, new challenges in
Sharing Economy and Big Data Analytics, First Edition. Soraya Sedkaoui and Mounia Khelfaoui. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.
26
Sharing Economy and Big Data Analytics
the tax system, changing production models, productivity growth and labor market changes (Grifoni et al. 2018). In response to this interpellation, it can be said that the consequences of the sharing economy are related to social, economic and technological aspects. The sharing economy potential can be a factor in socio-economic development. Indeed: [The sharing economy] also plays an important role, because after several decades of disappointing economic growth and unemployment for hundreds of millions of people, consumers are looking for the lowest possible costs, which the sharing economy enables, and as a result, the number of people willing to participate as suppliers is increasing. Furthermore, technological advances have played a crucial role in supporting the development of sharing platforms, payment processing and have also provided all parties with sufficient knowledge to carry out transactions. (Institut d’assurance 2017) The sharing economy already enjoys a good reputation among American consumers. A survey conducted by PwC1 states that 89% of them believe that it is based on trust between the participants in the exchange. About 83% believe that it makes life easier. About 76% believe that it contributes to protecting the environment. About 78% believe that it strengthens the cohesion and strength of the community. And finally, 63% say that collaboration in this economy is more pleasant than with companies (PwC 2015). In general, a sharing economy approach reduces the ecological footprint and waste of the urban population, allowing them to save money. It also represents chances for job creation and business opportunities. It helps to strengthen and encourage social cohesion, social capital and innovation. It also helps to reduce the cost of education and research (Cooper and Timmer 2015). The relevance of the sharing economy lies in the opportunities it offers, without however being free of potential risks. So, what are these opportunities that are likely to appeal to the business world in particular?
1 PricewaterhouseCoopers is a network of companies specializing in auditing missions, accounting and consulting for companies.
An Opportunity for the Business World
27
2.2. Prosumption: a new sharing economy trend for the consumer While the sharing economy has established itself for some people, it is because it has emerged in the right context. Economic and financial crises, unemployment and environmental degradation are all factors that support the adoption of this new philosophy. Individuals have become inventive in order to escape the difficult socioeconomic situation, and today’s consumer is no longer the consumer we knew. This is to such an extent that it is worth wondering whether or not this is the end of consumerism. It cannot be affirmed that the sharing economy revolutionizes consumer habits and upsets manufacturers. The “prosumer” phenomenon illustrates the willingness of individuals to shake up consumer habits: Sharing practices, whether they are commercial or not, make it possible to give a second life to many goods, sharing their use. By promoting use on ownership, these practices have a very high potential for sustainability. (Courtois 2016) The consumer is no longer indeterminate in the producer-consumer relationship: The rise of new technologies gives him the tools for increased vigilance. It also helps him become a producer of that which he uses in order to become a prosumer. (Van de Walle et al. 2012) What is a prosumer? The term “prosumer” combines the words “producer” and “consumer”.
Figure 2.1. Prosumer2. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip 2 Source: https://www.linkedin.com/pulse/entenda-o-prosumer-esse-novo-consumidor-jurandirsi queira.
28
Sharing Economy and Big Data Analytics
Historically speaking, it was first used in 1980 in Alvin Toffler’s work The Third Wave. For Toffler, the two functions of production and consumption were not separate, it was the industrial company that created a divide between them (GoyetteCôté 2013). In this context, Ritzer and Jurgenson (2010) argue that production and consumption have always dominated economies (capitalist and non-capitalist). According to both authors, prosumption involves both production and consumption, rather than focusing on one or the other. Although prosumption has always been predominant, “a series of recent social changes, particularly those related to the Internet and the web, have further centralized it”. “Ritzer (1983) deepened the notion of a prosumer by talking about the phenomenon of society’s ‘McDonalization’. In his opinion, the fast food industry in the United States began to involve the consumer in the work chain as early as the 1950s. The argument is that consumers have become the waiters (by carrying food and collecting garbage) and their own kitchen assistants for their meals (by adding condiments and certain ingredients or by composing their salads in salad counters). Consumers are happy that they feel the product is more in line with their tastes and are happy that they do not have to wait to be served (not to mention that this also reduces tips). For companies, this is a significant increase in productivity, since the consumer performs tasks that were previously the responsibility of salaried employees.” Box 2.1. “McDonalization” phenomenon of society
For potential consumers, the evolution of this concept represents a perspective and an opportunity to express their opinions and update their preferences. Several researchers have studied this growing phenomenon. Ritzer and Jurgenson (2010) report that: It is only recently that prosumption has become an important subject in the literature. Prahalad and Ramaswamy (2004) describe this trend as ‘co-creation of value’. Tapscott and Williams (2006) consider the prosumer as part of the new Wikinomic3 model, where companies put consumers to work. In his book Cult of the Amateur, Andrew Keen (2007) opposes this idea and criticizes this model, as well as any thought process that promotes consumer 3 Wikinomics: How Mass Collaboration Changes Everything, a book by Tapscott and Williams, published in 2006, describing an economic system based on massive collaboration and an intensive use of technology. It was translated into French in 2007 under the title Wikinomics: Wikipédia, Linux, YouTube, comment l’intelligence collaborative bouleverse l’économie.
An Opportunity for the Business World
29
dependence on the producer. However, Beer and Burrows (2007) see new relationships between production and consumption emerging, particularly on the web. 2.3. Poverty: a target in the spotlight of the shared economy The shared economy is a way to boost the economy, especially in times of crisis. Rooted in the principles of sustainable development, it aspires to participate in the objectives of the third millennium, particularly with regard to poverty and inequality between populations. So how can it be a means of pressure to shape a new economic model? The shared economy represents an opportunity for society by having the capacity to reduce poverty and therefore inequalities. Before considering an answer or answers to this question, it should be noted that the relationship between the poor or precarious population and the initiatives offered by digital platforms, within the framework of the sharing economy, is not the same as that of the wealthiest class of society. Wouldn’t using the shared economy “risk a bias between public policies, based on collective reliability and social action that is highly linked to charitable practices”? (King Baudouin Foundation4 2016). The study of King Baudouin’s foundation showed that the insecure population does not have access to the range of offers that fall within the scope of the sharing economy, in particular that of the business model, since it is not free. Furthermore, the activities of the sharing economy are carried out with digital tools that cost a significant amount. In addition, this segment of society feels inferior with regard to their inability to benefit from this kind of sharing. The objective of the sharing economy is to create an egalitarian society through exchange platforms. These must guarantee access for all and at a lower cost for the duration of time for which it is needed, such as a car in town, tools, housing and many other goods that would otherwise be very expensive if they were acquired through the classical economy (Pasquier and Daudigeos 2016). By definition, poverty is a person’s inability to satisfy his or her most basic needs: food, shelter, clothing, education, health, possession of comfort goods (car, travel, etc.). 4 The “King Baudouin” Foundation is an independent and multifaceted foundation, active in Belgium and at European and international level. It wants to make positive changes in society and invest in projects or individuals that can inspire others. It was founded in 1976.
30
Sharing Economy and Big Data Analytics
Poverty can be understood through two perspectives: “The first focuses on the resources, including goods and services, that are owned or available to characterize the level of poverty.” It is a monetary analysis of poverty. The second focuses on what people are able to do or be using the resources at their disposal. It is an analysis on “human capabilities” (Lasida et al. 2009). Box 2.2. Poverty’s two perspectives
In light of this definition, one wonders what a population living in poverty can share if it does not have the essentials to meet its most basic needs. Will access to platforms be enough to restore a sense of social reintegration to people that are living in modest circumstances? Indeed, it is necessary to have an asset base in order to be able to perform any exchanges: In this perspective, economic capital would be measured more in terms of access to property, than in terms of heritage. This new relationship with goods is therefore supposed to create a tremendous leverage effect for the poorest populations: what could not be bought yesterday because of a lack of resources can be borrowed or rented the next day at reasonable rates. (Pasquier and Daudigeos 2016) Also, a study by Williams and Windebank (2005) showed that people of modest circumstances express negative feelings of social exclusion in relation to secondhand purchases. Paradoxically, they are simultaneously driven by “positive feelings of agency power over purchasing decisions, because second-hand purchases sometimes remain preferable to donations, loans or non-purchases” (Benoît-Moreau et al. 2017). In practice, this means that collaborative platforms offer people with opportunities to acquire goods and feel reintegrated into society through the collaborative economy niche. But ideas that advocate the reliability of the sharing economy to institute equality for all are not shared by everyone. The literature on this issue is divided between those who support the idea that digital platforms contribute to social integration, and those who consider them to be reserved for a wealthy population. A study in 2016, by Benoît-Moreau, Delacroix and Parguel, on the economic benefits brought by purchase-sale practices on collaborative platforms, which integrated the “psychosocial benefits”5 variant for the first time and was associated 5 Self-image, confidence and a sense of integration.
An Opportunity for the Business World
31
with the economic benefits gained, revealed that the economic benefits increase when purchase and sale transactions are practiced on the platforms (they allow them to make gains on their budgets and close the month-end). In contrast, from a psychological point of view, this activity does not provide any satisfaction for people in financial difficulty. Instead, it reinforces the feeling of social exclusion because it is perceived as a “stigmatizing constraint”. The only psychological satisfaction is that of making a financial gain (Benoît-Moreau 2017). Without claiming to be utopian, collaborative platforms offer economic opportunities for a section of the population that are in financial difficulty. They give that section of the population a sense of satisfaction by allowing them to perform the act of buying with dignity, instead of begging or getting into debt. 2.4. Controversies on economic opportunities of the sharing economy The sharing economy is a growing phenomenon. Thus, the turnover of sharing platforms is growing rapidly, “the revenues generated by all these players in the European Union have increased from 1 billion dollars in 2013 to 3.6 billion dollars in 2015” (Winkler 2017). While the sharing economy provides profits, it also causes disproportionate effects in society, particularly in regard to employment disruptions. It promotes selfemployment, commonly known as freelance. This new approach to job searching will fundamentally change labor market regulation: Activity is rarely regulated by an employment contract. Self-employed workers generally do not benefit from any form of social protection (unemployment, health or retirement) and are therefore, with age or in the event of a turnaround in the economy, faced with an increased risk of poverty. (Winkler 2017) The Internet provides knowledge, information and opportunities worldwide. How can more people benefit from these digital dividends? – 250 million people in Europe and Central Asia are on the wrong side of the digital divide. – In Europe and Central Asia, the number of Internet users is higher than the number of bank account holders, but not everyone has equal access to digital dividends from the Internet.
32
Sharing Economy and Big Data Analytics
– The European Union has almost universal access to the Internet, so where are the European Google and Facebook? – In Central Asia, 60% of people do not have access to the Internet. – In South Caucasus, 40% of people do not have access to the Internet. Box 2.3. Strong points of Internet use
Opinions regarding the reliability of the sharing economy as an economic model that can meet the social objectives of sustainable development, particularly those that expect the decrease in poverty and inequality in the world, as stated above, are somewhat harsh. We cannot be categorical about a possible need to increase social inequalities, on the one hand, because the sharing and collaborative economy is a recent concept under construction. On the other hand, the implications of using trading platforms has to be analyzed with greater optimism. Indeed, in the context of trading platforms, pilot projects, which have a purpose of reducing inequalities, have helped to strengthen social cohesion. Projects such as Soli-Food and Welfood, for example, reflect this desire by providing a vulnerable population with access to food under the aegis of the fight against waste (King Baudouin Foundation 2016). The change brought about by the collaborative economy does not affect the nature of the goods consumed, but rather the behavior of consumers, in other words, their way of consuming, either through the B2C or B2B formula. Previous exchanges are now complemented by the emerging C2C formula, but also by other transactional methods that contribute to a more efficient flow of goods in the market. System of aggregated marketing
Companies
Consumers
Government
Companies
B2B Example: financial leasing of chemicals Scheme
B2C Example: bike system managed by the marketer
Consumers
C2B Example: recovery program
C2C Example: classified ads/auction sites
Government
G2B Example: high-tech equipment rental
G2C Example: public auction blocks
B2G Example: official car fleet management C2G Example: exchange programs for used cars sponsored by the government G2G Example: forestry equipment rental
Table 2.1. Configuration of new trade methods (Ertz 2017)
An Opportunity for the Business World
33
Although a number of these exchanges involve new goods (B2B, B2C, B2G, G2B, G2C or G2G), others are only associated with second-hand goods (C2B, C2C, C2G). It is worth noting that C2C exchange schemes can be found in each exchange configuration, making it a dominant factor. In fact, taking marketing into account inevitably implies taking C2C systems into account, because they fill the blind spots that marketers generally pay little or no attention to (C2B, C2C, C2G) (Ertz 2017). In the context of the collaborative economy, there is an overlap between production and consumption functions, in such a way that roles between individuals have reversed. “Exchanges between individuals are brought to the forefront, while organizations act as intermediaries” (Ertz 2017). As a result, the consumer is no longer a mere spectator, but rather an essential player in the business-consumer relationship. He can even do without the company by producing and marketing goods and services. Transactions initiated by the company: – B2B (Business to Business): describes business transactions between companies; – B2C (Business to Customer): describes the business transactions between a company and end consumers; – B2G (Business to Government): describes business transactions between companies and the government. Transactions initiated by the consumer: – C2B (Customer to Business): describes business transactions between consumers and businesses; – C2C (Customer to Customer): describes the business transactions between consumers; – C2G (Customer to Government): describes the business transactions between consumers and the government. Transactions initiated by the government: – G2B (Government to Business): describes business transactions between the government and businesses;
34
Sharing Economy and Big Data Analytics
– G2C (Government to Customer): describes business transactions between the government and consumers; – G2G (Government to Government): describes business transactions between governments. Box 2.4. Meaning of the different transaction methods
Basically, the economic model of sharing is essentially based on the “consumerto-consumer” (C2C) formula. The platforms have served as intermediaries between individuals (Buda and Lehota 2017). Based on this operating logic, the relations between producer, consumer and government have been redefined and have revealed the new exchange methods discussed above. Collaborative consumption embodies the change that has led economic networks to move towards a more local and collective organization, similar to an ecosystem, guided by new patterns of exchange (Ertz et al. 2017). By freeing the sharing economy, will companies be able to transform today’s threat into tomorrow’s opportunity? So many issues call on business leaders to be players in this new economic model, not just bystanders suffering the consequences of this change. The answer to these issues is suggested by the honeycomb representation of the collaborative economy. It was developed by Jeremiah Owyang (founder of Catalyst Companies, industry analyst). It outlines the scope of the sharing economy in the different sectors of the economy. The illustration in the form of a honeycomb structure is not accidental, explains the designer: “The cells are elastic structures that allow the accessing, sharing and improving of resources in a group” (Torfs 2016b). The representation is evolutionary, interpreting the change in strategies in companies. Currently, we are on its third version, the first version only had six honeycombs in the center of the graph. Several companies have joined the sharing economy movement and the list of sectors that have integrated the sharing economy into their strategies includes “the finance, logistics, food and transport sectors” (Hallet 2018).
An Opportunity for the Business World
35
Figure 2.2. Honeycomb representation of the collaborative economy (Owyang 2016). For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
On the third version, Owyang argues that the collaborative economy allows people to get what they need from each other. Similarly, in nature, honeycombs are resilient structures that allow access, sharing and growth of resources within a common group: Our latest version of the Honeycomb Framework, Honeycomb 3.0, shows how the collaborative economy market has grown to include new applications in the areas of reputation and data, worker assistance, mobility services and beauty. (Owyang 2016)
36
Sharing Economy and Big Data Analytics
The collaborative economy is no longer a simple fashion phenomenon, nor is it the apanage of some companies. However, the transition from a traditional to a sharing economy requires a review of all the issues. The French Chamber of Commerce and Industry (CCI 2014) suggests that companies consider the following points: – The company must be interested in the stakeholders influencing its internal activity, in other words, suppliers, customers, investors and partners. For example, to improve its relationship with its customers, the company creates a customer club to facilitate the exchange of its used products. At first glance, it will sell less, so what would it gain by creating this club? In the long term, “positive external factors would compensate for the risks” (CCI 2014) and the company would win the loyalty of these customers. – It must find an alternative to its business model6 in regard to the challenges of the collaborative economy. In this case, the company opts for a gradual change in its activity, as opposed to an abrupt one. It integrates social elements “on a small scale, by creating an agile ad hoc structure (in lean start-up mode) in order to experiment with more open models” (see Box 2.5). – The company must intercept a part of the clients from start-ups that are engaged in collaborative economy practices. Companies in the “traditional economy” have underutilized capacities and assets to mobilize a shared or collaborative approach: “It can be a customer base, a local presence, a network of business partners, exclusive suppliers or patented technology.” Lean start-up is a methodology that presents a new way of developing new products with an emphasis on rapid iterations, customer understanding, great vision and great ambition, all at the same time (Ries 2011; Bernet 2014). Box 2.5. Lean start-up
These initiatives allow companies wishing to transition to the sharing economy to gradually integrate this new economic model into their strategy. They can choose one initiative or the other according to the capacity of its resources (human and financial) and remain alert for all changes affecting its sector of activity.
6 “A business model describes precisely how your company will make money. In practice, this means defining what you are going to sell, to which customers, for what purpose, in what way, and for what benefit. In other words, it is about describing your activity” (Source: http://www.entreprendre-ensemble.com/pdf/outils/decrire-son-mo dele-economique.pdf).
An Opportunity for the Business World
37
2.5. Conclusion The sharing economy is gaining more and more ground, and this is not a “whim” in the business world. It really does exist and provides solutions to many of the problems caused by the traditional economy, which in this case, are those related to social well-being, particularly poverty and social exclusion. Even if there is some reluctance about the economic consequences of the collaborative economy, an initiative in favor of the collaborative economy is only beneficial for companies, because it opens up opportunities for them to enhance the value of their products and ensure their sustainability. That is why they must not resist: either they surf this new wave of possibility, or they get carried away. It is therefore better to turn these threats into opportunities. TO REMEMBER.– Opting for an approach in favor of the sharing economy is a great adventure for a company. The protagonists of this new model were motivated by their convictions and desires to contribute to the sustainability of the community, the economy and the environment. Nevertheless, activities in support of the sharing economy create value and specifically provide employment opportunities. On this last point, those that are “suspicious” of the sharing economy are opposed to the idea that this new economy can preserve and stimulate the world of work. Can we resolve this dilemma? Impossible, for the time being. The sharing economy is a topic in full development, only time will confirm or deny the impact of this new economy.
3 Risks and Issues of the Sharing Economy
The ultimate measure of a man is not where he stands in moments of comfort and convenience, but where he stands at times of challenge and controversy. Martin Luther King
3.1. Introduction The sharing economy is at the heart of scientific, political and social debates. Adopting a spectator attitude and suffering the consequences of this new economic model is a great risk for companies. How can we resist this digital revolution? Indeed, Pascal Terrasse nicknames it “the third industrial and digital revolution”, and he adds that it will not take effect in companies. The uberization phenomenon is the main wave to be feared. This concept is a new term which stems from the name of the company called Uber, which uses digital platforms, specifically for car transport services. This new economic formula is major competition for the traditional economy. The novelty of this phenomenon lies in the change that has particularly affected the supply and distribution channel. The sharing economy is also disruptive, as a small company with modest resources has the ability to destabilize the market for high-profile companies. This is a transitional and decisive phase that offers challenges for companies that will either be able to jump on the “bandwagon” or will not dare to take the plunge
Sharing Economy and Big Data Analytics, First Edition. Soraya Sedkaoui and Mounia Khelfaoui. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.
40
Sharing Economy and Big Data Analytics
and embark on this digital adventure. So, what are the challenges of the collaborative economy? Is it flexible enough to organize itself with other existing or emerging forms of economy? The collaborative or sharing economy adheres to sustainable development: it relates to the issues of sustainability of the economy, the achievement of social wellbeing and the preservation of nature, and must be cross-functional. 3.2. Uberization: a white grain or just a summer breeze? As an economic model, the sharing economy raises some concerns about the economic and social impacts it generates. It presents risks for competition and for the various stakeholders. The first impact that observers fear is uberization, to such an extent that well-established companies are afraid to wake up one morning and discover their activities have disappeared (Lévy in (Raimbault and Vétois 2017)). This fear is legitimate, because it turns out that it is impossible to fight against the various technological, organizational or service-oriented innovations that are commonly called “disruptions” (Raimbault and Vétois 2017). Raimbault adds that this concept is based on Schumpeter’s concept of “creative destruction”. “According to Schumpeter’s theory of creative destruction, competition leads to the replacement of inefficient outgoing companies by more innovative and efficient ones, and thus to a reallocation of market shares between these two groups of companies. More specifically, several studies use this creative destruction mechanism as a basis for concluding that the dynamics of company entry and exit are important as a source of improvement in the aggregate efficiency of sectors and economies.” Box 3.1. Creative destruction (Raies 2012)
What does “uberization” mean? The selected definition was based on two visions: the advent of Uber’s business model and the “disruption of traditional industries through innovation”. It should be noted that there are disparities in the conceptualization of uberization, with each author approaching it according to his or her approach. Some simply consider it as Uber’s business model, others think it is a disruption of conventional models, the latter retain both elements (Lechien and Tinel 2016). Uber is a multinational company operating under the “Transportation Network Company” label, based in San Francisco (California, United States of America) and was founded in 2009. It is a company that develops and operates mobile applications that connect
Risks and Issues of the Sharing Economy
41
individuals seeking urban transport with self-employed drivers who want to provide this service for a fee. San Francisco is a city known for its highly regulated taxi industry, characterized by high prices and inadequate services. However, the idea was born at the Le Web conference in Paris, an international event popular with Internet start-ups. Kalanick met Garrett Camp, then owner of Stumble Upon, and discussed the possibility of a reliable and quickly accessible black car service. During an evening of dinners and drinks in Paris, the two men half joked about the idea of a limousine to transport them safely to their hotel room. Until then, limousines had to be booked in advance. In Uber, on the other hand, they could be available immediately. The service has therefore gained in popularity, as its smartphone application has allowed users to access clean and elegant vehicles anytime, anywhere. This early exchange between the founders remind us of Uber’s original slogan: “Everyone’s private pilot.” Box 3.2. A brief history of Uber (Lechien and Tinel 2016; O’toole and Matherne 2017)
Generally, the platform business model is based on resources and skills, the value proposition and the revenue model. Uber’s model (see Figure 3.1), which is one aspect of the “uberization” definition, has the same key component, which is the use of platforms.
Figure 3.1. Uber’s business model (Diridollou et al. 2016). For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
42
Sharing Economy and Big Data Analytics
As for the characteristics of Uber, Brousseau and Penard (2007) propose three aspects from the existing literature: – the first aspect: from this perspective, digital networks are mainly understood as markets where supply must meet demand. The performance of transactions between “suppliers” and “consumers” (of functions) requires specific resources in order to solve a set of problems, such as transactional difficulties (matching supply and demand, secure transactions, risk management, etc.). From this point of view, they rely on their ability to organize transactions between both sides of the “market”, in order to reduce costs or make transactions more efficient; – the second aspect involves the economic elements of assembly itself and focuses on what we call “assembly costs”. From this point of view, digital networks are generally considered to be production networks in which production resources (the “features”) are combined (assembled) in order to produce a result that can be useful to users in producing alternative methods of combining resources, resulting in compromises between assembly cost levels, the ability to meet users’ needs and the value of consumers as a reward for their efforts; – the third aspect involves the economic elements of knowledge management and emphasizes the effectiveness with which information generated by users of digital goods is used to improve services and innovate. From this point of view, digital networks are generally understood as tools for sharing information and knowledge. Alternative ways of doing this have an impact on the effectiveness of collective accumulation and creation of knowledge, including influencing individual incentives to share information with others, the ability to retrieve information that is relevant to each innovator and the dissemination of knowledge (which is a public benefit). “Change before you have to”1, a quote that resonates particularly in the current context, marked by a major change in the economic model (Jacquet 2017). The advent of new companies without “any reputation” on the market has upset the stability of large companies by putting them to shame. These start-ups have freed themselves from the enslavement of the traditional industrial model and have found their way through innovation: Indeed, Facebook is a billionaire and doesn’t generate any content, Airbnb is growing everywhere but doesn’t own any rooms, Alibaba is the largest retailer in the world yet has no stores. (Jacquet 2017) 1 John Francis “Jack” Welch Jr., born November 19, 1935 in Peabody, Massachusetts, is an American businessman, former president of the American group General Electric from 1981 to 2001 and one of the most emblematic leaders in the United States during the period of 1980–2000 (source: Wikipedia).
Risks and Issues of the Sharing Economy
43
All these recent companies have taken the opposite direction from the classic business model that has developed in the last millennium. They also have one thing in common: “They have caused real strategic “disruptions” by redefining the way their sector operates” (Jacquet 2017). 3.3. The sharing economy: a disruptive model Companies are constrained into reviewing their strategies by taking new data into account, especially those related to digital platform technology. Innovations in this field are “a result and a global and cross-functional process, which is nurtured at all stages by multiple sources” (Schaefer 2014). They have reversed the relationship between producer and consumer, and have changed the very meaning of consumer satisfaction to the extent that the consumer’s main concern is no longer the possession of a good, but rather its use. This strategic revolution has been accompanied by new concepts, such as “disruption”. Literally, “disruption” means a rupture or break with a habit or activity that has been practiced for a long time. This break is attributed to “radical and architectural” innovation according to Abernathy and Clark (1993) Benavent (2017). This concept has thus been borrowed by the economy to refer to the functioning of certain companies that are in full expansion after a change in their strategy, such as BlaBlaCar, Airbnb and Uber (Jacquet 2017). It is a transformation of economic systems that has been well established for more than a century, shaken by multi-dimensional innovation, “as was the case with photography, where digital technology broke both a production model based on chemistry and a mass distribution model through the animation of a vast sales and service network” (Benavent 2017). Referred to as “disruptive innovation” in the book by Clay Christensen et al. in 2015, the concept of disruption has been defined as a disruption that affects businesses. Furthermore: [It] describes a process by which a small company with fewer resources is able to successfully challenge established companies. More specifically, traditional operators, focusing on improving their products and services for their most demanding (and usually most profitable) customers, exceed the needs of some segments and ignore the needs of others. Incoming disruptive members start by targeting neglected segments and adopting more appropriate functionality, often
44
Sharing Economy and Big Data Analytics
at a lower price. Owners, pursuing higher profitability in more demanding segments, tend not to react vigorously. The incoming members then move to the high-end, providing the performance that traditional customers demand, while preserving the advantages that have led to their anticipated success. When major customers start to adopt the offers from new incoming members in volume, disruption has occurred. (Christensen et al. 2015) Thus, the theory of disruptive innovation has proved to be a powerful way of thinking about innovation-induced growth, and companies that do not sense the refreshing change will be swept away by the wave of innovation, especially technological innovation. One of the most consistent models in the business world is the inability of large companies to remain at the top of their sector when technologies or markets change. Goodyear and Firestone entered the radial tire market quite late. Xerox let Canon create the small copier market. Bucyrus-Erie allowed Caterpillar and Deere to take over the excavator market. Sears has been replaced by Walmart. Box 3.3. Examples of companies swept away by the wave of disruptive technologies (Joseph and Christensen 1995)
Although disruption is induced by the prowess of digital technology, Christensen’s definition refers to three characteristics, according to Benavent (2017). New entrants, despite their small size, are a source of concern for companies with large resources. The disruption may also stem from the failure of the company’s monitoring activity, which failed to anticipate developments and innovations. It can also result from the organizational system, when structural and behavioral change is expensive. Thus, the following question can be asked: to what extent can an innovation in a company become a disruptive innovation? In this context, Christensen wonders whether Uber really is a disruptive innovation. The answer to these questions can be found in the definition of the term uberization. The latter does not yet have a consensus, each definition addresses an aspect of this new concept.
Risks and Issues of the Sharing Economy
45
Definition by Maurice Lévy, CEO of Publicis “Uberization refers to the fear that dominant companies will be harmed by innovative start-ups that are taking advantage of digital opportunities.”
Definition by Deloitte Deloitte puts forward a definition of uberization based on seven criteria: – disruption: traditional models are being challenged; large companies are being threatened by individuals who are disrupting the market in record time; – usage: the use of a given good or service prevails over the possession of the same good or service; – innovation: new approaches that bring a different perspective on our daily lives and well-being, through the user experience; – exchange: connecting people who are looking for a product/service with those who have a product/service to offer. This exchange may take the form of a swap, sharing, sale or lease; – digital: this exchange is supported by digital platforms such as the Internet, mobile, tablets, payment systems; – interdependency: the consumer is at the center and the number of intermediaries is reduced to a minimum; – dynamic: price adjusted in real time according to supply and demand. Access to the product/service is on demand, when and where the user wants it.
Definition by the uberization observatory “It is the rapid change in the balance of power, thanks to digital technology.”
Definition by Grégoire Leclercq, President of FEDAE, co-founder of the uberization observatory “Today, uberization is defined according to three pillars: the collaborative economy (which is a revolution of usage), digital innovation (mastery of technologies and Big Data) and the gig economy (the economy at work).” Box 3.4. The main definitions of the “uberization” concept (Lechien and Tinel 2016)
The definitions cited above are similar on two levels: the upheaval caused by the advent of digital platforms in the business world of companies and the tensions in the relationship between these companies and the platforms.
46
Sharing Economy and Big Data Analytics
Thus, if we use the following definition of uberization, it contains the answer to the previous questions. Indeed: The uberization of an industry can be defined as a phenomenon with the following two characteristics: the entry of a P2P platform into an existing industry, as well as the disruption of power relations between established companies in that industry and the P2P platform. The uberization of the economy therefore corresponds to the emergence of this phenomenon in more and more sectors. (Lechien and Tinel 2016) However, Christensen et al. (2015) argue that Uber can in no way represent a disruptive innovation, even though it is considered as such. Their arguments are based on the idea that a company that does not have financial and strategic achievements is not described as “disruptive”. Consequently, they put two arguments forward: – Disruptive innovations originate in low-end or new markets. Disruptive innovations are actually made possible in two types of markets that are neglected by traditional operators. The latter place less importance on the less demanding customers and focus on the most profitable customers. The disrupters intervene by offering products and services at low prices. – In the case of new markets, disrupters create a market where none previously existed. Simply put, they find a way to turn non-consumers into consumers. For example, in the early days of photocopying technology, Xerox targeted large companies and charged high prices to provide the performance required by its customers. School librarians, bowling league operators and other small customers, whose prices were out of the market, managed with carbon paper or mimeographs. Then, in the late 1970s, new competitors introduced personal copiers, offering an affordable solution for individuals and small organizations – and a new market was created. From this relatively modest beginning, manufacturers of personal photocopiers have gradually acquired a major position in the photocopier market. According to these authors, Uber did not initiate any of the innovations. It is difficult to argue that the company found a low-end opportunity: it would mean that taxi service providers had exceeded the needs of a significant number of customers by making taxis too numerous, too easy to use and too accessible. Uber didn’t just target non-consumers either: it also targeted those who found existing alternatives that were so costly or inconvenient that they used public transport or drove themselves. Uber was launched in San Francisco (a well-served taxi market), and its customers were generally people who were already present that used to rent vehicles.
Risks and Issues of the Sharing Economy
47
Christensen’s criticism is methodologically based, in other words, it has taken the opposite path to the practices of disruptive innovation. Disrupters first attract low-end or unserved consumers and then migrate to the traditional market. Uber has taken the complete opposite approach: first positioning itself in the consumer market and then attracting historically neglected segments (Christensen et al. 2015). Uber remains a disruptive company, as it has increased demand by developing a better and cheaper solution to meet its customers’ needs. Thus, many companies follow Uber’s example and are establishing themselves in the sharing economy, and even if they are not disruptive innovations, they still disrupt the stability of large companies. 3.4. Major issues of the sharing economy The sharing economy is mainly an access economy: sharing is a form of social exchange that occurs between people who know each other, without any benefit between members of the same family. Once it takes place in the market where a company acts as an intermediary, it is no longer sharing, because the consumers pay in order to have access to goods and services; it is an economic exchange for which consumers seek a utilitarian value rather than a social value (Eckhardt and Bardhi 2015). In this case, since the sharing economy only has the “sharing” title, is it not increasingly likely to become a “hegemony” of the economic model? Not necessarily. In the research carried out so far, the sharing economy seems to be a natural result of multiple social interconnections and is therefore a major trend, in addition to also having several advantages for our society. It meets the environmental, economic and social requirements of society. Apart from rationalizing the use of natural resources and reducing the volume of waste, consumers are adopting a new consumption style. Indeed, the sharing economy frees consumers from restrictions linked to ownership, oriented towards identity-based positions. In addition, considering the ease with which goods circulate in the economy, the country would be less dependent on exports (Ertz et al. 2017). Furthermore, technology has all the assets to overcome any obstacles it may encounter. However, it needs political and regulatory support in order to enhance and make the transition to this new economic model a reality. Admittedly, the sharing economy appears to be a social matter, but in reality, transactions are regulated by the market. Thus, consumers are driven by the goal of maximizing their satisfaction and the companies, by the realization of profits.
48
Sharing Economy and Big Data Analytics
As a result of this last reflection, the sharing economy raises certain issues. Referring to Pascal Terrasse’s work on the development of the collaborative economy, his report identifies 19 proposals for developing the sharing economy. Proposal no. 1: make the conditions for referencing offers more reliable. Proposal no. 2: make online opinions more reliable by requiring platforms to communicate that opinions have been verified and, if necessary, to specify the methods by which they are verified. Proposal no. 3: create a “rating space” for platforms. Proposal no. 4: ensure clear, readable and accessible information for consumers on: – the accountability of the platform towards users; – the quality of the user (professional or private) and the guarantees associated with this status. Proposal no. 5: pursue the congruent path between the social protection of those that are self-employed and that of employees. Proposal no. 6: assemble the personal activity account (CPA) in order to establish a real portability of rights. Proposal no. 7: consider periods of activity on the platforms as part of the validation of professional experience (VAE). Proposal no. 8: clearly define the conditions for breaking off relations with service providers. Proposal no. 9: develop additional security measures to promote access to housing, secure access to credit and improve users’ social security coverage. Proposal no. 10: organize training activities for service providers. Proposal no. 11: ensure the contribution of platforms to public charges in France. The French government must continue its determined action, alongside its international partners, in order to eliminate tax structures that allow certain platforms to avoid paying taxes in France. Proposal no. 12: clarify the doctrine of tax administration on the distinction between income and cost sharing, and that of the social administration on the notion of professional activity.
Risks and Issues of the Sharing Economy
49
Proposal no. 13: address the difficulties of recruiting digital professionals in the collaborative economy sector. Proposal no. 14: engage with the platforms in an attempt to automate tax and social procedures. Proposal no. 15: simplify the entrepreneurial process by allowing platforms to act as trusted third parties. Proposal no. 16: consider the development of the collaborative economy within the context of digital inclusion policies. Proposal no. 17: create an observatory of the collaborative economy. Proposal no. 18: promote experimental collaborative territories. Proposal no. 19: promote the development of home-working and secure the rights and duties of teleworkers. Box 3.5. The 19 proposals on the development of the sharing economy
This report provides information on two major issues that are intended to be adopted by public authorities (Acquier et al. 2017): – the collaborative economy must be subject to the same law as the traditional economy; – the collaborative economy must be valued, in particular for the benefits it brings to society. It provides employment opportunities and encourages innovation. Furthermore, collaborative consumption has the power to counter decadent consumerism and provide a framework for sustainability, based on community sharing: “Sharing makes a lot of practical and economic sense for the consumer, the environment and the community” (Acquier et al. 2017). Waste reduction is reflected in the circular economy, which aims to use products and components of the highest utility throughout their life cycle. For example, in its goal of zero waste, San Francisco City’s online platform, named Virtual Warehouse, recycles used appliances, office furniture and supplies (Ganapati and Reddick 2018). Ganapati and Reddick believe that there are four main challenges to the foundations of the sharing economy: – first, rental could have serious disadvantages, creating new class divisions and more inequalities;
50
Sharing Economy and Big Data Analytics
– second, Internet platforms are not necessarily egalitarian. They are themselves giant companies that reduce the benefits of immigrant workers; – third, the long-term benefits of the sharing economy are not clear; – fourth, there are security and trust issues related to information sharing. It is apparent that the sharing economy is facing a wave of challenges that discredit its effectiveness and sustainability. There is also the question about the problem of trust between users and start-ups, and between users themselves. However, the sharing economy also presents regulatory and legislative challenges. 3.5. Conclusion Digital technology has fundamentally disrupted people’s daily lives and the behavior of economic agents: consumers and producers. Although the consumer has easily gotten used to just clicking on a keyboard to satisfy any need, this is not the case for companies in the traditional economy. The sharing economy, commonly referred to as the “platform economy”, involves certain risks for companies that refuse to be immersed in this new economic model. Indeed, the phenomenon of uberization does not provide any relief and it is always seeking out commercial transactions, especially given that that it has a major asset: “cost reduction”. However, this new economic model faces certain issues and needs to be structured. It must carefully meet the expectations of individuals by protecting them and preserving the natural environment in which they live, through both regulatory and legislative measures. Lastly, to regulate the practices of the sharing economy, it must be equipped with economic policies, which in this case, is fiscal and monetary policies. TO REMEMBER.– The sharing economy is booming and attracting attention. Its contribution lies in changing the perception of consumption, the consequences of which affect production activity. It is not without its risks for industries. To avoid the full impact of inconvenient consequences of the sharing economy, industries must rise to the challenge, gradually adapt to its rules and take advantage of the new situation, bringing about new opportunities.
4 Digital Platforms and the Sharing Mechanism
I used to think that cyberspace was fifty years away. What I thought was fifty years away, was only ten years away. And what I thought was ten years away… it was already here. I just wasn’t aware of it yet. Bruce Sterling
4.1. Introduction Since the coming of the third millennium, the world has experienced a technological revolution under the “influence” of smartphones and tablets. The multi-functionality of these tools allows users to adopt new practices that previously required a fixed tool or the need to travel around. These “gadgets” are intelligent multi-purpose devices. They act as digital office assistants performing some of the functions of a laptop computer. The benefit of these devices is time-saving, in addition to the downsizing of resource consumption by offering the possibility of sharing goods and services, such as car sharing or carpooling. Uber, Ubereats and Airbnb are companies that operate in the collaborative economy and convey an innovative, non-traditional message, through a culture that favors use over ownership and sharing over undivided ownership. The technical basis of these technological tools is digital platforms, which contribute massively to the digital economy. They connect service providers and customers, at a cost:
Sharing Economy and Big Data Analytics, First Edition. Soraya Sedkaoui and Mounia Khelfaoui. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.
52
Sharing Economy and Big Data Analytics
The term ‘platform’ is commonly used to refer to marketplaces, social networks, search engines, collaborative sites and dating sites, comparators, mobile applications and many other players. (Bensoussan and Fabre 2017) Digital platforms play a key role in the development of the sharing economy, to the extent that it has been nicknamed the “platform economy”. However, one does wonder whether this name is justified or whether it is just a metaphor. The sharing economy is characterized by the absence of a hierarchy in its organization: decision-making is collective. Also, in the context of the sharing economy, there is a spirit of cooperation in which general interest prevails over personal interest. So, what does sharing mean in the context of digital platforms? Will it succeed in convincing those that are reluctant and critical of it, who accuse it of thwarting public regulations, of strengthening competition between companies in the sharing economy and the conventional economy? 4.2. Digital platforms: “What growth!” The term “platform” has several meanings, depending on the field in which it is used. In reality, the platform is a flat area used to carry objects and/or humans. Allegorically speaking, it is the place where thoughts, feelings and perceptions are exchanged. As for the digital platform, it is a virtual environment that allows the management and use (or both) of applications to run a service. Interest in platforms is growing rapidly, due to their ability to create value and boost innovation, particularly in industry. The French Digital Council1 describes digital platforms as follows: “Through their role as intermediaries and their place in the digital landscape, they effectively influence relations between users and producers of goods and services.” The emphasis here is on the intermediate nature of platforms and how they influence relationships between users. It also means that this description is given to
1 The French Digital Council (CNNum) is a French consultative commission created on the April 29, 2011. The CNNum is responsible for studying digital issues, in particular the issues and prospects of the digital transition of society, the economy, organizations, public action and territories. It is placed under the responsibility of the Minister who is in charge of digital technology.
Digital Platforms and the Sharing Mechanism
53
any digital space that links individuals who are motivated by the same objectives, be it economic, social, scientific, cultural or political.
Figure 4.1. Examples of digital platforms for entertainment and communication services
Additionally, social networks such as Facebook, Twitter, Instagram and WhatsApp represent digital platforms for social use, whose data-sets can be put to analysis and generate relevant information. The Information and Communications Technology Council (ICTC) defines digital platforms as: discrete ecosystems of storage units, computer codes and devices that, together, facilitate the public sharing of digital content from one provider to many consumers (Netflix) or from many consumers to many consumers (Twitter/ YouTube), using the Internet Protocol. This definition focuses on the technical side of digital platforms as electronic media, at its base, and downplayed by the Internet, which simplifies sharing between users, whether it be between consumers or between suppliers and consumers. There are no restrictions on platform designers and users. Developers can be one person, a group of people, or a private or public body. They have become a reality of everyday practices, especially in the economic sphere. Indeed, users are no longer content to exchange virtual information. They achieve their relationships through the exchange of goods and services, and virtual capital, such as Bitcoin which has a market value.
54
Sharing Economy and Big Data Analytics
“Bitcoin is a fully electronic cryptocurrency, which was created by Satoshi Nakamoto in 2008, and has be used since 2009 using digital wallets. Its popularity has increased from year to year, but its development seems exponential and seems to have gained credibility with institutional bodies since the beginning of 2017. It is also the most well-known digital currency on the Forex market. Characterized by anonymity, decentralization and a lack of regulation, this new form of payment does not depend on the issuer’s trust and does not require intermediaries.” Box 4.1. Bitcoin2
It is staggering: the virtual world, whose origin was the juxtaposition of two numbers – zero and one (the binary code) – directs people’s daily activities and offers behavioral perspectives on consumption that were still unknown 20 years ago. Who would have imagined that we would one day be making trips in a stranger’s vehicle, surrounded by strangers? Or that we would be spending a few days living in an apartment of a person who wasn’t your family or friend? 4.3. Digital platforms or technology at the service of the economy The economic world has been invested by the world of digital platforms. The latter did not invent it, but organized and harmonized it. That is to say, collaborative economy activities already existed under “the recirculation of goods, the decentralized production of goods or services, and the optimization of the use of durable goods” (Lambrecht 2016). Digital platforms are a technical contribution to collaborative economy practices. They are precursors, whose aim is to facilitate access to goods and services by saving time and resources. They have given renown to the collaborative economist and have greatly contributed to changing the behavioral style of consumers. The ability of platforms to disrupt an entire economic system, established over decades, resides in four characteristics mentioned in the report of the French Digital Council, which make them an enabling environment for the development of economic relations. Digital platforms have set up an information extension system. Verdier and Colin (2015) call this phenomenon the “multitude”. Platforms are designed in order to establish relationships that allow access to common resources between bilateral or 2 Source: https://www.lenetexpert.fr/quest-ce-que-le-bitcoin/?print=pdf.
Digital Platforms and the Sharing Mechanism
55
multilateral parties. Exchanges involve both information and physical goods, thus contributing to the development of social ties and business transactions. As for the economic model conveyed, the second characteristic of the platforms is based on the personalization of the service for users. It is simply about the “entrepreneurial spirit”, which makes it possible to identify possibilities and opportunities in difficult and transitional situations. It enables the creative construction of the new innovative economic model whose practices and teachings continue over time. This unique form of entrepreneurship lies in the fact that users can pass on knowledge or sell products without having to be familiar with the technology. The basic principle that binds platform owners and users is the “trust” created between the two parties of the transaction. Platforms not only exchange data or goods and services, they also spread trust between users. Why is trust mentioned in this context? Trust is mentioned when it comes to vulnerability in relationships (Lemoine et al. 2017). Indeed, uncertainty always accompanies exchanges on digital platforms, because it is carried out in a virtual world far from being “utopian” behind screen interfaces between people who do not know each other. Establishing trust in a peer-to-peer relationship is like instituting rules of conduct, established by managing companies to reassure the customer. Initially, these companies ensure the credibility of user information by checking identity and supporting documents. In order to use some of the platform’s features, Airbnb requires copies of official identification documents that include names, age, photo and address, such as a national identity card, passport or driver’s license. Financial information is also requested, which would involve bank account or credit card information. These confidentiality formalities between the company and users are not specific to Airbnb: all digital platforms require them in order to create a climate of trust between the two parties. Trust goes beyond the sharing of personal information, it is implemented through payment practices that are “secure, but also allow for the rapid and efficient management of potential disputes” according to the CNNum. In this context, digital platforms use prepaid virtual accounts, such as PayPal, Stripe and Worldpay. The use of platforms can similarly be beneficial from a budgetary perspective, since transaction costs are close to zero in value. As a new economic model, they
56
Sharing Economy and Big Data Analytics
have provided a new approach to transaction cost theory. This theory was first introduced by Coase 1937, and was then developed and shaped into what it is now by Oliver Williamson. Basically, transaction costs correspond to the brokerage commission rates in a purchase or sale transaction. They are classified as ex ante and ex post costs, the former being essential in the drafting, negotiation and guarantee of transactions. The latter, on the other hand, are associated with organizational and operational costs and bargaining costs (Lavastre 2001), among other things. However, it should be noted that other costs are identified later in the Coase and Williamson theory. “When one wishes to carry out a transaction on a market, it is necessary to seek out one’s contractor(s), provide them with certain necessary information and set the terms of the contract, conduct negotiations thus establishing a real market, conclude the contract, set up a structure in order to monitor the respective performance of the parties’ obligations, etc.” Box 4.2. Coase’s transaction cost theory (2005)
Indeed, the change that affects the composition of transaction costs is accompanied by the advent of new economic models. This is the case with the development of the use of platforms in economics, which brings a new perspective on transaction costs. According to the theory initiated by Coase, and then supplemented by Williamson, the costs basically cover the investments for the creation of the company, the production and marketing of the product. Their curtailment by digital technology requires us to review the original formulation of transaction cost theory. The mechanism for reducing transaction costs is mainly due to the nature of the inputs (factors of production) in the operation. The data, which constitutes the raw material of the platform economy, is transcribed and transmitted from one user to another without requiring production costs, even if it has a creation cost, according to the CNNum: Unlike the cost of producing a new audio recording on a vinyl disc, the cost of a digital recording is close to zero. This steep decline in marginal cost is also the result of an economic model that is increasingly based on the use of assets without ownership. Essentially, the principle of exchange over appropriation is highlighted in the formation of costs, given that goods and services are not part of the users’ assets.
Digital Platforms and the Sharing Mechanism
57
Digital platforms also greatly reduce ex ante and ex post costs, the costs of drafting, negotiating, organizing, operating and marketing; they become insignificant. The principle of sharing does not apply to all platforms, there are those designed to disseminate information, such as Google, and those that connect people by giving them the opportunity to express themselves and transfer data and knowledge, such as Facebook and Twitter. Other platforms are dedicated to communication, such as WhatsApp and Skype, and video sharing, such as YouTube. There are also platforms that offer audiovisual entertainment services (DVDs of film and music), such as Netflix and Deezer. In the financial world, platforms operate online payment systems that support online fund transfers, such as PayPal. 4.4. From the sharing economy to the sharing platform economy Digital platforms have led to the digitalization of the economy, but also to the digitalization of society. Collaborative platforms, among other digital platforms, embody the ethics of sharing between individuals. They are considered as a technical intermediary in a transaction (for profit or non-profit) of goods and services between individuals or professionals. Business transactions, in the context of the sharing economy via digital platforms, are undergoing a remarkable evolution. It is the direct result of individuals’ addiction to the Internet, but also a response to the expectations of individuals who yearn for a more egalitarian economy. “The total amount of transactions in the five main sectors of the collaborative economy in Europe – finance, accommodation, transport, personal services and business services – could increase 20-fold in 10 years, from just €28 billion today to €570 billion by 2025. Since 85% of this value is received by individuals who provide their services, the turnover of collaborative platforms should reach €83 billion by 2025, compared to €4 billion today. [...] France, alongside the United Kingdom, is a leader in the collaborative economy market in Europe, thanks to its favorable regulatory environment. [...] Individuals, the providers of this service, will be the first to benefit from this new economy, with an expected €487 billion, or 85% of total collaborative economy transactions (€570 billion) by 2025. According to projections by PwC experts, four of these five sectors could carry out transactions worth €100 billion a year, with just business services failing to reach this threshold.” Box 4.3. Evolution of the turnover of collaborative platforms (PwC 2016)
58
Sharing Economy and Big Data Analytics
Although the platforms were originally designed on a community principle and on social relationships, they do not all serve the same purpose. We can therefore list the platforms according to four criteria (Busuttil 2016): – Criterion based on community and social relationships: an edifying example is that of BlaBlaCar, whose core mission is the fulfillment of social relationships, unlike Airbnb, which is instead moving towards a service of its model. – Criterion that considers the response to social expectations: Heetch is a digital carpooling application that has attracted non-professional drivers to work a latenight shift between 8 pm and 6 am. The functioning of this start-up accommodates at least three societal issues: opening up disadvantaged neighborhoods, “flexibility and security of travel at night and the end of drink-driving”. Heetch was banned in 2017, but its principle remains an example of a platform that meets social mobility requirements in addition to night taxis, unlike Uber pop, which used to be a Heetch competitor that had no social value. – Relevance criterion: few trading platforms allow the players involved in the exchange to set their own prices for the goods or services they rent, sell or trade. Applications such as Uber, BlaBlaCar and Ruche have helped to establish good governance with contributors. To this end, Uber has lowered the price by 20%, BlaBlaCar has regulated prices in order to avoid abuse by certain owners and Ruche, “qui dit Oui”, goes one step further by letting suppliers freely set their price. – Criterion improving the distribution of social values within the economic model: platforms that are distinguished by their willingness to share economic values within the sharing model (companies and contributors). Thus, with its slogan “the Ruche qui dit Oui”, Ruche is an example of an avant-garde company that embodies the break with the liberal model, unlike platforms like Booking.com, which instead project themselves into an economy that is “ultraliberal, wanting a monopolistic status at all costs and seeking to capture most of the value created” (Busuttil 2016). The sharing economy is a rapidly changing economic model, whose prospects are not yet well defined. It highlights human qualities such as sharing and solidarity. Above all, it circles back on the notion of “exchange”, which has been disregarded by the capitalist and liberal system. Thus, to what extent do digital platforms enhance the notion of “sharing” in the sharing economy? Digital platforms emphasize the principle of the public interest. The speed with which information is disseminated allows the spread of solidarity between individuals and social initiatives, in order to ensure more effective social protection.
Digital Platforms and the Sharing Mechanism
59
Digital platforms have paved the way for a new economic model, breaking with the economic practices of the 20th Century. The operating principle of these platforms has awakened, or rather, stimulated social ties in commercial relationships. It conveys new approaches to attitudes towards the consumption of goods and services, in addition to the exploitation of resources available to humanity. The idea of sharing a car or accommodation, or making a donation to a charity or humanitarian cause would never have been as big without the intervention of platform technology exploits. Sutherland and Jarrahi (2018) argue that the commercial success of social economy companies, as well as the social future of collaborative networks, are often closely associated with the technologies on which they are based. More generally, the sharing economy presents new contexts for the use of technology and for the types of social relationships that are carried out through digital channels. Despite the criticism that some people have about digital platforms, regarding the unfair competition they create and the stable jobs they turn into insecure jobs, digital platforms facilitate and encourage the principle of sharing between individuals. 4.5. Conclusion With the advent of the digital revolution, people’s daily lives have been transformed in every way. This new technological paradigm has generated new business models, the principle of which has changed the nature of economic activity between agents. Digital platforms are one of the greatest inventions in the digital world. They have given a new economic model renown. They have subtly penetrated the economic world and have taken different names that relate to the economy such as: platform economy, collaborative economy and sharing economy. Technically speaking, digital platforms represent a fictitious space where algorithmic operations are performed, often referred to as “applications dedicated to the execution of a service”. Through digital platforms, the economy is expanding at a staggering rate. It makes it possible to provide services and carry out sharing and collaboration activities between users. Digital platform business models are growing because they have changed commercial “traditions” between consumers and producers, and have invented new ones by removing the distributor from the equation.
60
Sharing Economy and Big Data Analytics
Nevertheless, for some researchers, the concept of the platform economy does not reflect the sharing economy, because its vocation is purely commercial and not about sharing. TO REMEMBER.– The sharing economy has exploited digital platforms. While the extent of sharing activities is insignificant compared to profit-based activities on digital platforms, the latter deserve to be called “sharing economy platforms” because they contribute to conveying a message of solidarity in society.
PART 2
Big Data Analytics at the Service of the Sharing Economy
Sharing Economy and Big Data Analytics, First Edition. Soraya Sedkaoui and Mounia Khelfaoui. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.
5 Beyond the Word “Big”: The Changes
“Data! Data! Data!” he cried impatiently. “I can’t make bricks without clay.” Arthur Conan Doyle (1892)
5.1. Introduction For a long time, companies did not see the importance of data. But now, with the change in the way data is collected and analyzed, companies are coming to use it on a daily basis. The power of data and the potential of analytics, as well as the resulting opportunities, have changed the way companies view data. The way in which data, that is generated from various sources and different forms and formats, is collected, analyzed and interpreted, becomes an intrinsic and essential element of the company’s various operations, because it is what makes it possible to carry out and operationalize the company’s various activities. Currently, data are what companies are examining more closely than ever before. Each company, be it large or small, manages a considerable amount of data. Sometimes, they are able to manage this data using different automated analysis tools. However, when data cannot be analyzed by traditional tools, it is time to think about Big Data and analytics. Big Data – the term you’ve probably heard on television, read in magazines and newspapers or even heard in conferences and seminars – is more than just a buzzword. This new term is used by all managers and decision-makers, and is the subject of entire conferences.
64
Sharing Economy and Big Data Analytics
Several companies have taken advantage of the opportunities offered by this exciting and fascinating field, in order to improve their performance and boost their decision-making process. There is therefore a clear impact of Big Data on business practices. But what lies behind this vague concept? A detailed look at what Big Data analytics encompasses will help you understand its potential and the opportunities that can result from it. The original concept is, in fact, nothing new, because data has always existed. But the advent of technologies, especially the Internet of Things (IoT), has generated an exponential growth in these data. This thus leads us to say that Big Data is simply the new form of data, like old wine presented in a new bottle. The first question that arises after reading these few lines is probably related to the changes that have made this phenomenon so important. Why does the term “Data” suddenly appear in every conversation and why are companies so interested in it? In other words, what has changed so much and justifies such hype? This is what we will discuss in this chapter, so that you can understand the importance of this phenomenon. 5.2. The 3 Vs and much more: volume, variety, velocity To answer the above questions, you need to be aware that as you read these lines, thousands of tweets have been exchanged, millions of requests have been analyzed by Google, millions of “likes” have been given on Facebook, hundreds of hours of new YouTube videos have been downloaded and several Netflix videos have been launched! In total, in less than a minute of reading, a huge amount of data, in different forms, has been created in real-time. Data here, data over there; we are witnessing the era of Big Data. This phenomenon, which seems ambiguous to some, can be understood in different ways. Its understanding varies from person to person, depending on their adopted perspective when examining this phenomenon (technological, industrial, commercial, etc.). This is why many people refer to Doug Laney’s 3 Vs, stated in 2001, when looking for a more complete overview. Laney’s model, adopted 10 years later by Gartner, established a definition of Big Data. In other words, there are three properties or characteristics that can help you break down the term, nicknamed “the 3 Vs”: volume, variety and velocity. These
Beyond the Word “Big”: The Changes
65
three aspects are essential to understanding how to measure Big Data and how to differentiate it from the forms of data we know (qualitative and quantitative data). These 3 Vs would certainly give you an overview of the scale of the data and how quickly their quantity increases. And while one party focuses on the development of analytical tools, another part seems to be sticking to the modifications of the 3V model, in order to complete its definition, which is often related to the volume, velocity and variety of data. So, as the Big Data field matured, more Vs were added. Several characteristics such as: value, veracity, etc., were introduced to improve the understanding of Big Data and enrich its depth. A fourth “V” has been added a posteriori, which refers to the value related to the goal of the companies or the benefits generated by the data, in particular by giving them meaning (Sedkaoui and Monino 2016). In essence, when we talk about Big Data, we are not just talking about the amount of data that can be converted to information. We’re also talking about analyzing this data, in a way that can generate value. However, to define Big Data’s point of origin from which you can examine its details, we will give an overview of the different characteristics of this phenomenon. This will allow you to see how each characteristic affects the current context, and understand how true value can be generated from analytics. 5.2.1. Volume Big Data is a concept that we have regularly come across in recent years. We can even say that it is almost impossible not to have heard the term before. Moreover, a simple search on Google Trends will allow you to see how omnipresent this term has become in the various conversations around the world. What seems more interesting is to understand what Big Data is. You would probably like to understand this concept, its characteristics, its opportunities and promises, and everything that has made Big Data so “big”. But the most important thing is to know where we are going to start. Think, for example, of a Boeing jet engine that creates about ten terabytes of data every thirty minutes (10×1012 bytes). Or, a Jumbo jet plane travelling across the Atlantic Ocean, whose four engines can generate approximately 640 terabytes of data (640×1012 bytes) (Rogers 2011). Now, with an average of more than 100,000 flights per day, can you imagine the amount of data produced per day by all the planes that have flown across the sky?
66
Sharing Economy and Big Data Analytics
Facebook, for example, stores photos. This simple statement may not seem impressive to you, right? But once you realized that Facebook registered more than 2 billion users in 2018 (more than the population of China), and that in total, more than 250 billion photos were stored, this may change your mind. Yes, 250 billion photos is already a lot, but you haven’t seen anything yet. In fact, more than 250,000 photos were uploaded every minute on Facebook in the same year. What? You think one minute isn’t a lot? Alright, come and take a look at Table 5.1. What’s happening on… Facebook Logins Google Search queries Play Store & App Store Apps downloaded Twitter Tweets sent YouTube Videos viewed Netflix Hours watched Messaging Emails sent
Volume/minute 973,000 3.7 million 375,000 481,000 3.4 million 266,000 187 million
Table 5.1. What happens on the Internet in one minute (2018)
Thus, every minute in the world, more than 700 people use an Uber, 18 million text messages are sent and 2.4 million snaps are created, 1.1 million profiles are swiped on Tinder, 174,000 images are scrolled through on Instagram, $862,823 are spent online and more (Sedkaoui 2018b). Do you now see what happens in a minute on the Internet? Do you still believe that one minute is not a lot? In this case, you can imagine how many thousands of personal and professional transactions are made every minute. We use applications to measure our sports performance, our cars transmit data, just like planes, trains, etc. As such, many industrial and commercial processes are controlled by connected devices. This means that data is transferred from several computers, mobile devices and connected objects around the world. Big Data therefore involves a huge volume of data. This is probably the first thing that comes to mind when we hear of the concept of Big Data. 5.2.2. The variety Among all the statistics given in the previous examples, you may have noticed that we talked about photos, tweets, videos, SMS, etc. Various types of data now
Beyond the Word “Big”: The Changes
67
exist, and are very different from each other. Big Data is therefore much more than just a “huge amount of data”. In other words, we cannot understand Big Data without understanding its variety, linked to these two notions: – the variety of data types; – the data structure.
Figure 5.1. Nature and types of data. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
Yes! Photos, tweets, videos, emails, etc. are also data, but in different forms. In addition, the data in Big Data is of a largely unstructured nature, and cannot be collected, stored and analyzed in the old row/column database format. This means that each form of data requires a specific type of analysis that differs from the other forms. If you take an e-mail, for example, you will realize that no content in an e-mail is identical to any of the thousands or even millions of other e-mails sent. Each e-mail consists of different addresses of the sender, sent to different destinations at different times. Each contains a message (text), and possibly, attachments (documents, photos, etc.). This diversity doesn’t just involve the variety of forms of data produced, but also the variety of sources from which these data come from. This variety, as you can see, represents the second property that joins the first one, in order to help you understand how different the nature of data in Big Data is from the one we knew. 5.2.3. Velocity Volume and variety are important, but if you really want to understand the context of Big Data, you must pay particular attention to the velocity aspect, which
68
Sharing Economy and Big Data Analytics
is just as important as the two previous ones. Just the same as data collection, processing and analysis must be carried out in real-time. The Internet and the various connected objects considerably accelerate the generation of large amounts of data (Maheshwari 2019), from e-mails to shared photos on social media, the number of videos viewed or downloaded, and of course, the data from geolocation systems. Data can move quickly. Just think of our Facebook example and the 250 billion photos stored last year. Remember that it was clearly indicated that in 2018, every minute, Facebook users posted more than 250,000 photos online. Yes! Every minute. This means that Facebook must manage more than 15 million photos every hour, which means that it must process and classify more than 360 million photos per day to facilitate their recovery and reuse. In this context, velocity therefore consists of measuring the speed with which data are produced, analyzed and stored. However, it should be noted that there is no absolute rule, such as the number of bytes, for example, that we could consider as a threshold to define the velocity of data. 5.2.4. What else? When we hear the word Big Data, we can only think of the three characteristics, often used to define this phenomenon. In other words, the 3 Vs: volume, velocity and variety. Its definition is most often based on these characteristics. But, while this model is certainly important and correct, it is now time to add other crucial factors. And you know what? They all start with the letter V. Definitions given by words starting with Vs have become so classic that practitioners want to explain every aspect of Big Data with a new V, such as: – value: which refers to the benefits of Big Data, that can be obtained through appropriate analysis; – veracity: the reliability of the data source, its context and its importance for the resulting analysis; – variability: or the presence of inconsistencies in the data; – validity: or the extent to which the data are accurate and correct; – volatility: describes how long it will take for data to be available and how long it should be stored;
Beyond the Word “Big”: The Changes
69
– visualization: the manner in which the results of the data processing (information) are presented, in order to ensure greater clarity; – viability, vulnerability, and many others. Thus, there are several characteristics to help us make data useful and generate value. Each characteristic plays an important role in enriching the context of Big Data. All these additional Vs illustrate other challenges in extracting value from unconventional data-sets. Veracity, variability and volatility, for example, refer to the problem that Big Data often consists of unguaranteed accurate data, irregularities in data-sets, an uneven data flow and complicated navigation between components. Veracity, which often refers to data quality, is an important aspect of Big Data, because not only does data come from everywhere, but it also belongs to everyone. Visualizations are also necessary to make sense of the results. With all these characteristics, defining Big Data is not so easy, because the term itself refers to many aspects and new features. Certainly, these features are all important. But the most important thing is to understand how to generate value that corresponds to the key of obtaining useful information and better conduct the decision-making process. This value is only possible by analyzing huge amounts of data (volume) from different sources (variety) in real-time (velocity). 5.3. The growth of computing and storage capacities The increasing automation of all types of processing and analysis implies an exponential increase in the volume of data, which is now counted in petabytes, exabytes, zetabytes and yottabytes. Do these units mean nothing to you? Well, it is thanks to them that the amount of data produced is measured. A zettabyte, for example, refers to 1021, in other words, 1,000,000,000,000,000,000 bytes. Is that a lot? Are you able to read that number? Can you imagine the computing and storage power needed to process such a large volume of data? Thus, be aware that this huge figure only represents a fraction of the amount of data produced each year around the world. Another thing you should know is that, in one way or another, everyone is a producer and consumer of data today. You may ask yourself: but how is that possible? Just look at all these connected objects (smartphones, computers, tablets, smart watches, GPS, smart cars, etc.) and the various available applications on which we have become heavily dependent.
70
Sharing Economy and Big Data Analytics
The Internet of Things (IoT) also contributes to the increasing data size (Gartner 2017), and has led to an increase in the number of applications based on artificial intelligence (AI), which is an enabling technology for Machine Learning. These connected objects and applications also generate data, which leads to an increasing lack of computing and storage capacity for data flows. The 3 Vs, mentioned above, are a challenge, because (Sedkaoui and Gottinger 2017): – the volume focuses on the storage, memory and computing capacity of an IT system and requires access to a cloud; – the velocity emphasizes the speed at which data can be absorbed and significant responses produced; – whereas the variety makes it difficult to develop algorithms and tools to process this wide variety of input data. The first two Vs are important from an IT perspective (storage and processing), whereas the last V is important from an analytical perspective (Sedkaoui 2018b). As a result, IT systems must allow for the storage, analysis and extraction of relevant knowledge. It’s no longer about the word “big”, but about how this large volume of structured, semi-structured and unstructured data must be captured, stored and analyzed in order to generate “value”. The differentiating factor in today’s business is not having or collecting data, but the power to analyze it, transform it into information and extract knowledge from it (see the knowledge pyramid, (Ackoff 1989)). So, with the ever-increasing volume of data, its variety and velocity, it was necessary to reconsider the storage and processing of this volume, in order to continue to extract useful information. 5.3.1. Big Data versus Big Computing Being able to process and analyze the large volume of data, available in different formats and come from different sources, is another answer to the previous questions. This is mainly due to the increase in computing power. Big Data therefore refers to data-sets that are so large and complex that traditional data processing tools cannot support them.
Beyond the Word “Big”: The Changes
71
Understanding the 3 Vs – volume, variety and velocity – is an essential element in understanding the Big Data universe, but it is far from over. We do agree that when a size of data is produced in real-time and arrives in continuous flow from multiple sources, it qualifies as Big Data. However, to transform this quantity into value, we need large amount of computing and at a lower cost, so as to examine and process this data. So “big” is, of course, a term relating to the volume, variety and velocity of data, but more importantly, it is also a term relating to the IT infrastructure that is in place. Because this “big” volume also induces great analyses. The principles of statistics, forecasting, modeling and optimization remain the same. It is this computing capacity that has the potential to monetize data. Today, we can run billions of simulations thanks to the advancements and progress of IT tools, which tend to focus on Big Data technologies. There are now several tools available for solving problems, determining models and identifying opportunities. In the context of Big Data, the analysis process is highly dependent on the technology adopted. This is not only valid in this context, but in all areas and activities. There are many Big Data technologies available to implement a data analysis process. Usually, the data science community uses one of the following two programming languages: – R, which refers to a statistical environment and facilitates the manipulation of mathematical functions and the graphical representation of results; – Python, for its simplicity and the availability of libraries that implement the most used algorithms. It should be noted that we will use this program to analyze the databases that we will discuss in this book. Also, if you want to join the Analytics arena, you need to master technologies like: Hadoop, MapReduce, HBase, Spark, Apache, etc. Box 5.1. Examples of Big Data technologies used
Thus, the growth in computing power is the true change that has opened the door to great opportunities. 5.3.2. Big Data storage Nowadays, companies focus so much on data analysis and processing that they often forget the need for data storage solutions. They are mainly interested in how they will be able to transform all the data collected into value.
72
Sharing Economy and Big Data Analytics
However, the data accumulated does need a space where they will be stored. This means a necessary infrastructure that allows the storage of large amounts of data. In 2011, the McKinsey Institute proposed its own version of the term Big Data. This term is defined by this institute as: Data whose scale, diversity and temporal distribution require new technical architectures and more advanced analyses in order to extract knowledge that represents a new source of value. (Manyika et al. 2011) If you look at the large number in a single zettabyte, you will realize that growth is not only in computing capacity, but also in storage capacity (Table 5.2). Year 1992 1997 2002 2013 2018
Storage capacity 100 gigabytes/day 100 gigabytes/hour 100 gigabytes/second 28,875 gigabytes/second 50,000 gigabytes/second
Table 5.2. The growth of storage capacity
Technology is constantly evolving and, as a result, machines and connected objects are consuming more and more data. Table 5.2 shows that in 2018, 50 000 x 109 bytes of data were created per second, thus billions of bytes were created daily. The volume of data will only continue to grow very significantly. It should also be noted that the quantity mentioned in the previous definition may vary from one company to another (small or large) and from one business sector to another (trade, industry, service). Of course, it can also vary depending on the analytical technologies used and the size of the databases available in a particular company or sector. This is why, in many companies, or in many sectors, the amount of data ranges from a few dozen terabytes to several petabytes, which represent thousands of terabytes (1015 bytes). Big Data generally introduces data-sets that have a volume that exceeds the capacity of the IT tools commonly used to capture, manage and process said volume in real-time. Therefore, it is essential that large data storage systems evolve.
Beyond the Word “Big”: The Changes
73
5.3.3. Updating Moore’s Law A simple computation of all the data on our physical activities that are recorded through connected objects (smartphones, smart watches, etc.), such as our heart rates for example, can not only save our lives, but also those of thousands of people. Here is just a simple example of the power of data. The data generated on a person who jogs for 20 to 30 minutes in one day may seem like little to you. But imagine the amount of data that will be generated by calculating the pulses of a billion people who own smart watches or applications on smartphones? Now imagine the quantity produced per week, per month and per year. Every sharing or comment on Facebook or Instagram, every video you watch on YouTube, is data. With billions of subscribers, data will continue to grow exponentially. This exponential growth was predicted more than 50 years ago by Gordon Moore. In his 1965 article “Cramming More Components onto Integrated Circuits”, Gordon Moore indicated that: “The number of transistors per circuit of the same size doubled every eighteen months, with no increase in costs” (Moore 1965). In his article, Moore noticed that the number of transistors on a chip had almost doubled each year from 1959 to 1965. The article also stated Moore’s law for the first time, according to which the number of transistors on a chip would roughly double every two years. Table 5.2 shows that processor performance is increasing exponentially, as if hard disks were applying their own version of Moore’s law. This is the case because computers and various connected objects nowadays are much more powerful and smaller than those used 10 or 20 years ago. This law certainly has an effect on data because the placement of sensors and smart interconnected objects everywhere increases their dissemination. And the bigger this dissemination is, the more the quantity of data produced by these objects increases. As the amount of data increases, so does the need to improve computing and storage capacity at a low cost. The lower the price, the more the dissemination of smart objects and sensors increases, and so on. Do you see? It is an endless loop, which leads to the motivation of using and improving machine capacities. We can then say that Big Data pushes the limits of Moore’s law. Big Data has brought new ways of doing things and new methods of collecting, analyzing, integrating and visualizing data.
74
Sharing Economy and Big Data Analytics
As processing power increases and storage capacity increases, it will be necessary to examine new computing fields and rely on a large number of essential resources, including the cloud, which is one of the main means by which computing can continue to grow. Cloud computing, which has attracted considerable attention in recent years, is seen as an infrastructure that has revolutionized data storage and computing capabilities. These capabilities have led to advancements in the field of computing, which have improved the power of processing and analysis. This has enabled companies to be more efficient and introduce innovative solutions. In parallel with these advances, Big Data technologies, which heavily rely on cloud platforms for storing and processing data flows, have been widely developed. They are among the most frequently used technologies in the development of applications or services in various sectors (health, education, energy, web, etc.). 5.4. Business context change in the era of Big Data Big Data and its various applications have significantly changed the business playground. Technological advancements in analytical tools and algorithms have unlocked the operational potential of both large and small companies, and have had a significant impact on their various activities. As the collection, analysis and interpretation of data become more readily available, they will have a significant impact on each company in several important ways, regardless of its business sector or size. The awareness of the importance and potential of data for companies has therefore been raised. Now it’s time for companies to think about how to make the most out of it. Whether it is for the operationalization of projects (revenue growth, cost reduction, etc.), the optimization of different activities, the creation of new services, the improvement of the decision-making process, etc., Big Data analytics has transformed the way companies act. As Kenneth Cukier, Data Editor of The Economist in London and co-author of Big Data: A Revolution That Transforms How We Live, Work, and Think, published in 2013, points out: “More isn’t just more. More is new. More is better. More is different.” Yes! Because more data means new methods of analysis, new ways to glean useful information, more efficient decision-making and more effective operationalization of business strategies and perspectives. Therefore, with more data, companies become more creative.
Beyond the Word “Big”: The Changes
75
According to Frizzo-Barker et al. (2016), this phenomenon could potentially change the way companies think about data infrastructure, business intelligence and analysis and information system strategy (Sedkaoui 2018a). By using its different applications, companies will be able to improve their decision-making process and, consequently, their performance (McAfee and Brynjolfsson 2011). Big Data is becoming a trend and a lever that every company must consider in its culture and in the design of its business model. Its various applications have changed the way companies operate and have created new opportunities for growth. By leveraging this data, companies can identify new opportunities, create more efficient operations, increase profitability and improve their service. Companies that exploit all available opportunities will not only gain a competitive edge, they will also transform their business models and industries by stimulating growth in new sectors using new technologies. So, just as Cukier had thought, more doesn’t just simply mean more, because having more data has completely changed the game. Having more data has allowed companies to operate differently and see new opportunities that were not previously visible. Having more data has led us into a completely different situation. Big Data has therefore changed the business paradigm and will continue to influence its context, which in turn will undoubtedly affect society as a whole. However, the real value of this paradigm is only achieved when companies exploit the full range of opportunities offered by each byte of data. In other words, this is only possible when everything that is said is done. 5.4.1. The decision-making process and the dynamics of value creation The different Vs of Big Data challenge the fundamental principles of existing technical approaches and require, as already mentioned, new forms of processing that promote decision-making, knowledge extraction and operations optimization (Curry 2016; Sedkaoui 2018b). It is well known that the decision-making process is often based on the model of limited rationality: intelligence, modeling, choice and control, forged by Herbert Simon in 1977. With the varying amounts of data produced every second, this model becomes complex and needs to be improved (Sedkaoui 2018a). Indeed, decisionmaking, strategy development and anticipation of change have always been dependent on the availability and quality of data. Information technology experts create algorithms to better manipulate and organize data. These experts are supported by data science experts who are responsible for the initiation, development and application of quantitative methodology.
76
Sharing Economy and Big Data Analytics
They collect, store and analyze data, using software and programs, applying the necessary analyses (data analysis algorithms) to generate models or information that subject matter experts, or decision-makers, interpret in order to make strategic decisions and create value. All these steps will be part of a data value chain, from collection to decisionmaking. With, of course, the contribution of the various stakeholders supported by the technologies used. The value chain therefore depends on the quantity and quality of the data that will be put to analysis. The value chain categorizes a company’s generic value-added activities, allowing them to be more optimized. A value chain is composed of a series of subsystems, each of which includes inputs, transformation processes and outputs. In traditional models, key value creation activities, which can be described using the value chain developed by Michael E. Porter in 1985, view data as a support rather than a source of value. Box 5.2. Data in Porter’s value chain
The decision-making process depends on the process of creating meaning, in other words, the process of creating knowledge. In this respect, decisions are datadriven, that is to say, they contain an important analytical phase, based on the processing of structured or unstructured data, be it from internal (company databases) or external (web, social networks, etc.) sources. A data-driven decision-making process is therefore a process that includes an analytics aspect. This aspect can help to identify decision-making opportunities in the intelligence phase of Simon’s model, where the term “intelligence” refers to the discovery and extraction of knowledge. In this particular case, this phase consists of identifying the opportunities for which a decision must be made (Simon 1997). From the decision-maker’s perspective, the importance of the data analysis process lies in its ability to provide valuable information on which to base decisions. It allows decision-makers to take advantage of opportunities resulting from the wealth of data generated by supply chains, production processes, customer behavior, etc. (Frankel and Reid 2008). This process includes several distinct phases, which we will discuss in more detail in the next chapter.
Beyond the Word “Big”: The Changes
77
5.4.2. The emergence of new data-driven business models When technologies become cheaper and easier to use, they transform businesses. This is the case with Big Data technologies, with a substantial reduction in data processing and storage costs. The range of opportunities offered by these technologies allows companies to transform their business model through new visions of the value chain into a vast, fully digitalized ecosystem. This involves new ways of operating and has led to a reassessment of the foundations of business and an interpretation of new ways of creating value. This is what Sergey Brin and Larry Page of Google, Jeff Bezos of Amazon, Jack Ma of Alibaba.com, Travis Kalanick and Garrett Camp of Uber, Reed Hastings of Netflix, Brian Chesky, Joe Gebbia and Nathan Blecharczyk of Airbnb, Mazzella of BlaBlaCar, and many others, have done. You may be wondering how they did it. The answer is simple, they took full advantage of the opportunities of Big Data analytics by deciphering the underlying messages revealed by each data byte. Of course, these innovative models have also understood the importance of the digital ecosystem and digital platforms. Their success depends on their resistance to change and their ability to monetize it in their “ROI” (Return on Investment). They have exploited the potential of analytics, not just to differentiate their business models, but also to innovate. And innovating in a business model means exploring new ideas, creating new value propositions and setting up new value chains. It is therefore a question of innovating, but in a different manner. As Drucker (1994) indicated, it is about how to make a difference. This is what the business model concept itself means. The concept refers to the technique that companies adopt in the face of competition and change, based on the skills and resources available (McKelvey and Zhu 2013). The business model therefore describes how a company can create, deliver and capitalize value. Value creation, for the benefit of all stakeholders in the business context, is the main element of a company’s success. However, in a generalized digitization ecosystem, be aware that evaluating and evolving a business model is not an option, because everything changes quickly and only the most agile companies will resist. But when a company makes a strategic decision to develop its project, data is always useful. The experiences cited best illustrate this point. If these companies, among others, managed to surprise us, it is because they set up a series of innovative
78
Sharing Economy and Big Data Analytics
business models oriented and based on data and analytics, or what is called the datadriven business model. Creating a model that is oriented and driven by data means making decisions based on the analysis of that data. For each company, or start-up, that designs and implements what constitutes a data-driven business model, which has changed the business landscape, there are several experiences that can serve as a reference to capture new sources of revenue. 5.5. Conclusion We hope you now have a better understanding of the content of this chapter: what has changed so much with Big Data? The purpose of this chapter was to provide an overview of this change. This phenomenon brings together a set of challenges on the exploitation of data, which are continuously produced with new scales in size, form and complexity. That is why when you hear the term Big Data, you often hear the term “analysis” introduced in the same sentence. In this context, companies must make the most of this extensive landscape of data in order to apply multiple technologies wisely, to carefully select key data for specific surveys, in addition to innovatively adapting large integrated data-sets to support specific queries and analyses. As a result, Big Data analytics has become a major trend in the world of technology and business. It is a great challenge for companies. Before going into analytics, which will be well detailed in the next chapter, let us summarize what we have been able to understand through the different points that have been developed in this chapter. TO REMEMBER.– This chapter has taught you that: – Big Data is a generic term for any collection of data that is so large or complex that it becomes difficult to process using traditional data management techniques; – Big Data is often characterized by the 3 Vs: - V for volume (the size of the data): how much; - V for variety (data type): diversity; - V for velocity (speed): its speed;
Beyond the Word “Big”: The Changes
79
– often, these characteristics are complemented by other Vs, namely: value, veracity, etc.; – growth in computing power and storage capacity are two other aspects that quickly change as the amount of data continues to increase; – in the Big Data universe, you will encounter different types of data, and each type generally requires different processing and storage tools and techniques; – given its importance, many companies are investing in Big Data in order to better understand their customers, boost their processes and differentiate their business models.
6 The Art of Analytics
The secret of getting ahead is getting started. The secret of getting started is breaking your complex overwhelming tasks into small manageable tasks, and starting on the first one. Mark Twain
6.1. Introduction Through the previous chapter, you have understood that Big Data has marked a major turning point in the use of data and is a powerful vector for growth and profitability. A complete understanding of a company’s data and potential is, without a doubt, a new vector of performance. For a company to successfully use Big Data, it is essential to obtain the right information and sources of knowledge, so as to ultimately help the company make more informed decisions and achieve better results. But to achieve this, a company needs to go beyond simply answering the question about amounts of data, because Big Data isn’t just about collecting data, but also about using it effectively. Data is therefore a form of wealth that we cannot achieve without understanding and mastering the process that leads us to this wealth. This process, which gives the company the power to gain a competitive advantage, is called analysis. This is why when you run into the term “Big Data”, you often hear the term “Analytics” in the same sentence.
Sharing Economy and Big Data Analytics, First Edition. Soraya Sedkaoui and Mounia Khelfaoui. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.
82
Sharing Economy and Big Data Analytics
In the Big Data universe, data and analysis are totally interdependent: one without the other is practically useless. In other words, there is no data without analysis and no analysis without data. The process we apply to data is the means by which many products and services can be created. To generate value, as previously discussed, we need a stack of tools and techniques that can build a framework in order to extract useful functionality from large volumes of data. This includes: – the identification of relevant data; – the definition and choice of methods able to detect correlations to see through the woods, in order to establish clear guidelines and know what to use or not to use. The aim is to make operations and various functionalities more agile, to apply them to faster (real-time) and more strategic decisions based on data. It’s just like looking in the rearview mirror when driving a car. Understanding this process is therefore essential, so that you can follow the different examples of algorithmic applications that will be discussed later in this book (Part 3), and in turn, see the opportunities of Big Data analytics. Thus, before explaining how to use large data-sets in the context of the sharing economy, we will identify the different phases of the data analysis process. This chapter will serve as an instruction manual to help you understand this process. 6.2. From simple analysis to Big Data analytics In his book, Big Data at Work: Dispelling the Myths, Uncovering the Opportunities (2014), Thomas H. Davenport stated that we have moved beyond the time when data was analyzed for weeks or even months, ultimately achieving results that we could make available to users we never met. Nowadays, a new term has already been coined, that of the Internet of Things (IoT), in which everything is connected and which produces large volumes of data in different forms and in real-time. The amount of data increased and, as a result, a different need quickly emerged. Hence the importance of Big Data analytics, which is not a new trend. The importance of data analysis began since the first use of information technology by companies in the late 1960s. This is where experts began to focus on databases and their evolution. By analyzing these databases, companies have always tried to understand their customers’ behavior, or their general context.
The Art of Analytics
83
However, with the emergence of the 3Vs phenomenon, a new form of analysis has emerged, with different types, methods and stages. This means that traditional tools cannot manage the amounts of data that are arriving in a continuous flow, in various forms and from different sources. This data has become so large and important that several tools and methods have been developed to analyze and leverage it. What used to take a few hours, days, weeks or even months in traditional analysis can now be processed in seconds or in real-time. Davenport and Dyché (2013) identified three eras of analytics: from “Analytics 1.0”, which began in the 1950s, to “Analytics 2.0”, which appeared when Big Data was first introduced (early 2000s), to “Analytics 3.0”, which refers to the current situation, and can be described by combining Analytics 1.0 and Analytics 2.0. The main differences between version 1.0 and 3.0 are: – the use of external data; – the use of unstructured data; – the use of prescriptive analysis. Box 6.1. Analytics: from version 1.0 to version 3.0
In the Big Data universe, the thing that distinguishes a company from its competitors is the ability to identify the type of analysis that can be optimally exploited to its advantage. Many companies do not know where to start and what type of analysis may be more favorable. Data analysis can be divided into three distinct types. These are: descriptive analysis, predictive analysis and prescriptive analysis. At first glance, you can easily distinguish between the first and second type. Moreover, their name indicates it: “describe” and “predict”. But what about the third type? Simply put, prescriptive analysis refers to the type of analysis in which the actions necessary to achieve a specific objective are determined. These three types of analysis are the most common, and are interdependent methods that allow the company to better manage the amount of data it has. Some companies only use descriptive analysis to provide information on the decisions they face. While others may use a combination of analysis types in order to obtain useful information for planning and decision-making.
84
Sharing Economy and Big Data Analytics
In the next section, we explore these three types of analysis in detail, referring to an example so that you can see what each type brings to improve a company’s operational capabilities. 6.2.1. Descriptive analysis: learning from past behavior to influence future outcomes In its simplest form, data analysis involves a form of descriptive analysis (Delen and Demirkan 2013). Descriptive analysis is simply the analysis of historical (past) data of an event to understand and assess its trends over time. It involves the use of analytical techniques to locate relevant data and identify notable trends in order to better describe and understand what is happening in the data-set (Sedkaoui 2018b). These techniques are therefore used to describe the past, which can refer to any time in which any event has occurred. Consider basic calculations such as sums, means, percentages, etc., as well as graphical presentations. This analysis is based on standard database functions, which just require a knowledge of some elementary calculations. The most important thing is that the company can have a clear vision of the future results of the event (production, operations, sales, stock, customers, etc.), based on how it has operated in the past, in order to know how it should act. This provides a clear explanation of the behavior of an event and why certain events occurred. The objective is to find the reason for success or failure. What makes this type useful is the fact that it is possible to draw lessons from an event, based on its behavior in the past. This helps to understand how it can influence the future. Of course, by using the right set of tools, the company can learn powerful lessons. This type of analysis is of great importance and uses predictive models that are considered a subset of data science (Waller and Fawcett 2013; Hazen et al. 2014). 6.2.2. Predictive analysis: analyzing data to predict future outcomes With the increasing amount of data, the improvement of computing power, the development of Machine Learning algorithms and the use of advanced analysis tools, many companies can now use predictive analysis. This analysis goes beyond the previous analysis, it aims to predict future trends. It should be noted here that the word “predict” is not synonymous with what will really happen in the future. This cannot be the case and no analysis or algorithm can
The Art of Analytics
85
do it flawlessly. Indeed, predictive analysis can only predict what may happen in the future. The basis for this type of analysis relies on probabilities. By providing usable information, predictive analytics allows companies to understand the future. This type of analysis involves predicting future events and behaviors in data-sets, based on a model constructed from similar previous data (Nyce 2007; Shmueli and Koppius 2011). There is a wide range of applications in different fields, such as finance, education, health and law (SAS 2017; Sedkaoui 2018a). From analyzing sales trends based on customers’ buying habits (to recommend personalized goods or services), to forecasting demand for operations or determining risk profiles for finance, in addition to analyzing feelings (otherwise known as a sentiment analysis) etc., the scope of this analysis is vast and varied. A sentiment analysis consists of studying a text in order to assess the tendency of the emotions it conveys. The objective is to determine whether the text indicates a positive, negative or neutral feeling. This is usually measured by assessing a piece of text between (-1) and (+1), with the “+” side indicating the positive feeling and vice versa. Box 6.2. Sentiment analysis: a common type of predictive analysis
Predictive analysis can be considered one of the most commonly used methods by companies to analyze large data-sets to find an answer to the following question: what could happen based on previous trends or patterns? To answer this question, predictive analysis combines all the data, filling in the missing data with historical data from company databases, and looks for models that identify the relationships between the different variables. In this case, it should be mentioned that the amount of data available is not a problem; the wealth of the data is, however, often questionable (Sedkaoui 2018b). This is necessary when people want to perform a prescriptive analysis. 6.2.3. Prescriptive analysis: recommending one or more action plan(s) Delen and Demirkan (2013) found that Big Data has introduced the possibility of a third type of analysis, called “prescriptive analysis”, which goes beyond previous analyses by recommending one or more action plans to be undertaken. Just as the two types of analysis described above are closely related, prescriptive analysis is closely linked to predictive analysis. Prescriptive analysis uses evidence-
86
Sharing Economy and Big Data Analytics
based predictions to inform and suggest a set of actions, because in order to be able to prescribe a series of actions, it is first necessary to anticipate the situation in the future. This analysis not only allows companies to consider the future of their own processes and opportunities, but also to determine the best course of action to take in order to achieve competitive advantages. This means that this type of analysis is more heavily based on determining what actions need to be taken in order to optimize certain results, than on what can happen if the company continues to do the same. It is an advanced analysis concept based on: optimizing analysis results and identifying the best possible action plans. This type uses a combination of Machine Learning techniques, tools and algorithms. It thus requires a predictive model with two additional components: – the data; – the evaluation system, or feedback, to analyze and monitor the results of the measures taken. Prescriptive analysis can influence the company’s activities and even the way decisions are made. It also has a very significant impact on companies in all sectors by enabling them to improve their effectiveness and become more efficient. Prescriptive analysis is therefore used to quantify the effect of future decisions in order to indicate possible outcomes before decisions are actually made. The autonomous car is the best example of the application of this analysis. This car analyses and assesses its environment, and decides on the action to be taken based on the data. It can accelerate speed, slow down, change direction to avoid road traffic, etc. But it should be noted that this car also relies on predictive and descriptive analysis in its data analysis process. So, with one of the three analyses previously described, or by combining two or three at a time, a company can understand its data and generate relevant information at different levels of analysis, that can facilitate decision-making and inform the different action plans. Leveraging data to analyze trends, creating predictive models to identify potential challenges and opportunities in the near future, these all offer new ways to optimize processes and improve performance. Different types of analysis are established by experts in the field in order to facilitate the interpretation of the data.
The Art of Analytics
87
However, the company must ensure it chooses the right analytical option in order to increase its ROI, reduce its costs, operationalize its activities and guarantee its success. The easiest way to do this is to look at the answers that each type can generate. 6.2.4. From descriptive analysis to prescriptive analysis: an example From one type to another, as can be seen, the analysis must be applied in a sequential way, in other words, descriptive, then predictive, then prescriptive. However, this does not mean that one type is better than another, on the contrary: these three types complement each other (Sedkaoui and Khelfaoui 2019). What? Were the previous discussions not clear enough? Still can’t tell the difference between these three types of analysis? Do you want to understand what these different types of analysis mean? And what type of analysis is most appropriate and for which situations? You probably need some additional explanations. In this case, we will use a simple example, hoping that it will help you. Imagine that you are a mobile application developer, and you want to get an overview of your business. First, you would start with a simple analysis to calculate the number of downloads of your applications and the profits you have made over the last three years, for example. You would also, for example, define certain cohorts of interests, such as the most important applications downloaded, the least downloaded applications, etc., and calculate the benefits of each. This is simple, because it is just an assessment of your activity, and it is done in order to understand its behavior (its history). What you have just done is describe your activity based on the data you already have. This is the descriptive analysis. Now let’s assume that you want to develop these applications to expand your business (create new applications or develop some of them). The problem is that you can’t have accurate information about which applications will be the most popular in the future or how much profit they will make, etc., because all you have is historical data about your business. What you can do here, for example, is take this historical data and create a model that will allow you to predict what will happen in a month, in six months, in a year or more. So, depending on the current and past situation, you will have a more advanced forecast (such as the number of downloads, profits, etc.) on your activity. You have then moved on to the second level of analysis, or predictive analysis.
88
Sharing Economy and Big Data Analytics
Now, let’s imagine that you want to go further in your business by developing a new application for e-learning and increase its benefits, and that you want to know what you need to do in order to achieve it. Would you need to look for new customers (universities, etc.) (Plan A), stop developing the least downloaded applications (Plan B), develop relationships with the academic world or professionals who are best qualified to understand learning needs (Plan C) or launch an advertising campaign (Plan D)? To know which plan to choose, you would need to illustrate some forecasts. But maybe you’ve never done an advertising campaign and you’ve always worked with the same customers. In this case, you must use new data sources to calculate the strategic effects of actions A and D. You would also need to look for an optimal solution that combines actions A, B, C and D. This is where the third type of analysis comes in, which is called prescriptive (or optimal) analysis. To better clarify this example, we want to draw your attention to a very important point, which is related to these different plans (A, B, C and D) that interact with each other. This is the case, for example, with Plan D (launching an advertising campaign), which may not have the same impact when you decide to start Plan B. Because, at some point, the less downloaded applications may be of interest to some customers (universities, etc.) who may ask you to redesign them and adapt them to “e-learning”, which makes Action D irrelevant. However, there are solutions to this type of optimization problem, which comes from optimal control theory. This theory seeks a control solution in order to achieve an optimality criterion (Kirk 2012). Box 6.3. A solution for optimization problems
6.3. The process of Big Data analytics: from the data source to its analysis The concepts behind Big Data analytics are not new. Companies have always sought to use descriptive, predictive and prescriptive approaches to optimize solutions. Similarly, for many years, researchers and academics have been using different analytical techniques to study different phenomena. However, valuing the volumes of data available in real-time in different forms, in addition to developing intuitive and innovative ideas, requires a solid layer of highly advanced data analysis techniques. This is not all, because following a structured approach to data analysis is also very important. But before talking about the flow that will make your tasks easier and describing the different analysis techniques that will allow you to understand the different
The Art of Analytics
89
mechanisms of the analysis process, do you specifically know what data analysis is all about? Data analysis is a process that starts by defining the objectives or questions you hope to answer. To achieve this, you need to collect all the data related to these questions, clean them up and prepare them for exploration and interpretation, in order to take advantage of them, or to obtain useful information that can suggest conclusions to better guide your decision-making process. Davenport and Harris (2007) define it as: The extensive use of data, statistical and quantitative analysis, explanatory and predictive models and evidence-based management to better guide decisions and actions. Nowadays, in different literatures, data analysis is often linked to the notion of Business Intelligence, due in particular to the increased processing capabilities of machines. This concept, which has become popular since the 1990s (Chen et al. 2012), shows the importance of data analysis capabilities. Data analysis is the process of inspecting, cleaning, transforming and modeling data to discover useful information, suggest conclusions and support decisionmaking. It focuses on the knowledge discovery for descriptive and predictive purposes, to discover new ideas or to confirm existing ideas. Data analysis is a key step in the knowledge extraction process, often known as knowledge discovery in databases (KDD), which involves the application of different algorithms to extract models (knowledge) from data. Box 6.4. Knowledge discovery in databases (KDD)
It is true that analytics often gives off a “crystal ball” impression, capable of revealing the secrets behind each byte of data, but this does not prevent the fact that there is considerable work being done behind the scenes. This process, which you may probably find complex, consists of a series of distinct steps or phases. If you want to understand this analysis flow, you must follow certain steps, from a specific idea (an objective) to the implementation of good questions, to data mining, to the preparation (collection, cleaning, etc.) of this data that is to be analyzed in order to create value. In this context, in order for you to understand the process of data analysis and the power of analytics, we use the logic used during the “Taylorism” period. At the
90
Sharing Economy and Big Data Analytics
time, to simplify a given problem and deal with its complexity, you just needed to break it down into simpler sub-problems (Sedkaoui 2018b). We will therefore follow the same logic to help you understand the process of Big Data analytics. 6.3.1. Definition of objectives and requirements The first step in each process is to define the objectives that need to be achieved. It means asking the right questions. The aim here is to make sure that you have clearly defined the what, the how and the why, in other words, the context of your project, so as to hold on to the best action plan. This means that before you enter the data analysis phase, you need to understand the context in order to define the main problems and identify needs. At the end of this small exploration step, you will have a global view of what you want to accomplish. This step must take place before there is any data. This step must be done before because once you have collected the necessary data, you will examine it to guide your decision-making strategy. At this point, you will need a baseline or indicator to determine whether the project will achieve its objectives. This reference or indicator should be defined at the beginning of the process. This means defining the target by considering the different constraints and potential solutions that will make it possible to achieve this target. Let’s keep to the previous example: if you want, for example, to develop an application, it will be necessary to define the notion of “interest” that lies behind the creation of such an application, and then produce the work plan in accordance with the requirement of that interest. It is important here to understand the motivation of your project before undertaking the actual analysis tasks. This step will therefore allow you to explore all possible avenues in order to identify the different variables that directly or indirectly affect the phenomenon that interests you. This will help you understand the following: – the objective (target); – the context (opportunities and challenges); – data sources; – the necessary analytical techniques;
The Art of Analytics
91
– relevant technologies; – the cost; – the necessary time. 6.3.2. Data collection Before you begin the data collection step, you must understand and identify the data that could be useful for your business. It is not just about the quantities of data or Big Data, but the value generated by the analysis of these data. In the era of Big Data, the amount of data produced will continue to grow. There will be more data from various sources in real-time and in different forms, meaning that data will be captured and stored in many forms, ranging from simple files and database tables, to emails, photos, videos, etc. This means that you need two types of data (Sedkaoui 2018b): – Those that are already stored internally, or the data you have. These may include databases, various digital files, e-mails, photos, archives, etc. You will search these data sources for the data that are relevant to your project. – The ones that come from outside, that you will couple with your internal data. External data refers to all data that is not generated by your activity. Among these data, we can distinguish the following: social network data, videos, tweets, geolocation data, etc. Most of these data are unstructured, which generally makes them difficult to use. But they are useful for enriching internal data. There is also data from other companies, or even from government and organizations. They make their data accessible to all and share them for reuse. This type of data is known as Open Data. This data can be of excellent quality and depends on the party managing it. In this step, you will therefore look for the necessary data to achieve the predefined objectives. This will be difficult, because during this phase, you will face some data issues, such as: – the volume, variety and velocity, already detailed in Chapter 5 of this book; – the complexity of data collected from different sources and in different formats; – problems with missing data or outliers; – the need to ensure the consistency and quality of data, etc.
92
Sharing Economy and Big Data Analytics
You should therefore pay particular attention to this phase, as it is necessary to ensure the reliability of the data before starting an analysis. According to Soldatos (2017), data collection has several particularities compared to the consolidation of traditional data from distributed data sources, such as the need to process heterogeneous data for example. The data collected in this step will probably be a very important source of information, on which the relevance of the model built will be based. What you need to do now is to prepare this source for the modeling phase. This is extremely important, because the performance of your model depends largely on this step. 6.3.3. Data preparation Once you have collected the necessary data, proceed to the preparation step. This is the stage that gathers the activities related to the construction of the data to be analyzed and which, of course, have been elaborated from sources (raw data). During this step, the data will be: – classified according to selected criteria; – cleaned; – coded to make them automatically compatible with the tools that will be used. It is a real challenge to take advantage of the amounts of data you have been able to collect. You will find that some types of data will be easy to analyze, such as data from relational databases. This is structured data. But if you are working on unstructured data, you need to prepare it for analysis. Next, these data must be cleaned. Data cleaning is a sub-process of the data analysis process. It corresponds to the transformation of so-called unstructured data into semi-structured or fully structured data, in order to make them more consistent. Data cleaning includes several elements, which are elaborated on in the following three points. 6.3.3.1. Missing values Deleting the missing value deletes incomplete and inconsistent data from a dataset. Removing them will improve the model’s performance. If you want to delete the missing value, you have two choices: – simply delete them; – replace them with default values.
The Art of Analytics
93
Missing data can be deleted when the absence of certain values does not allow for tangible observation. In this case, you simply delete the observation, because it does not bring any real value. But in other cases, you can just replace it. For example, if you work on a customer database of a chain of stores in France, you will have several variables or observations, such as: customer size, gender, date of birth, etc. And if somewhere in this database, the ‘size’ variable observation is missing, you can replace it with the average size of the French. This may sound simple, but it is extremely necessary to understand the context of this chain of stores in order to assign logical and appropriate values to the phenomenon to be analyzed. This will help you avoid a bias that will affect your model. This does not necessarily mean that these values are false, but they should preferably be treated separately, as some modeling techniques cannot manage them. 6.3.3.2. Outliers An outlier is an observation that seems to be far removed from other observations. In other words, it is an observation in the database that follows a completely different logic from other observations in the database. For example, let’s look at the database of this chain of stores, where you have several people’s salaries that vary between 2,500 euros and 5,000 euros per month. If you observe that a customer’s salary reaches 15,000 euros per month, you can consider this value as an outlier. For example, this customer may be a celebrity (a football player, an actor, etc.). So, this type of value is a different observation from all observations. It is essential to remove these outliers so as not to bias the model obtained. This must be done because the goal is to build a model that defines the real situation, not one that adapts to different values. 6.3.3.3. Errors In the process of data analysis, it is very important to correct errors. But what types of errors are they? When you work with large databases, you often have to deal with two types of errors: – a data entry error, blank spacing, upper-case letters, “bugs” (caused by machines), etc.; – an error that reveals inconsistencies between the different data.
94
Sharing Economy and Big Data Analytics
For the first class of error, some data require human intervention. This intervention may not correctly enter certain values or may leave blank spaces at the end. For example, if you are entering the answers from a customer satisfaction survey, and one of the questions proposes two conditions: “Female” and “Male”. In this case, if you are a man, you can enter “Man” instead of “Male”, or even enter the first lowercase letter, as you can leave a blank space after the entry. This type of error can be corrected by using Excel, for example. A simple sorting can show you the different cases where these values have been entered incorrectly. Another human intervention can correct errors. For the second class of error, this can happen when, for example, you enter “Male” in some places and “M” in others, even though both mean the same thing. This can also happen when you indicate the salary of some employees in dollars ($) and others in euros (€). An example of this class of error is to put “Female” in one table and “F” in another, when they represent the same thing: that the person is a woman. Another example is one where you use pounds in one table and dollars in another. These are two different currencies, whereas the same unit of measure had to be used. This can happen, especially if you work with data from different countries around the world. In this case, a simple conversion is necessary to correct the values. This is also the case for data presented “per hour” and “per minute”. Here, you just need to use the same unit, either hour or minute, to correct errors. You now understand the importance of cleaning, processing and preparing data from different sources. It is time to move on to the next step in the process, which will allow you to understand the different relationships and correlations that exist between the different variables in your database. 6.3.4. Exploration and interpretation Once you have collected the right data, to answer your question in the first step, having cleaned and prepared them well, it is time for an exploratory analysis in order to interpret the data and make sense of them. This step will help you understand the composition of your data, its distribution, and the different correlations (looking for similarities and differences) that can result. Because when you manipulate data, you may find the exact data you need, but it is more than likely that you will have to revise your initial question or collect more data. In any case, this initial analysis of trends, correlations, variations and outliers
The Art of Analytics
95
helps you focus on analyzing your data in order to better answer your question, in addition to any objections that others may have. One of the simplest ways to explore your data is to use the basic parameters and indicators of descriptive statistics: the mean, standard deviation, frequencies, etc., which give a concise picture of the data and determine the main characteristics, thus allowing you to understand the overall trend. For example, to manipulate the data, you can use a pivot table in Excel. This table allows you to sort and filter the data according to different variables, perform basic calculations and search for correlations. You can also present the results in graphical form (histograms, box plots, point clouds, 3D diagrams, etc.) that will be used as tools to visualize the data. Data visualization occurs when information is presented in a visual form. Its purpose is therefore to communicate information in an understandable way. This makes it easy to compare data when they are presented visually. Cross-referencing data with on another during visualizations allows us to see less obvious relationships to be observed at first glance. The information will be easier to capture when presented in the form of images or graphs, thanks to the different visualization techniques. It is therefore essential to have an exploratory view during this step, to verify the data and take a step back to correct anomalies if necessary. The objective here is to understand your data and ensure that it is properly cleaned and ready for analysis. In practice, the techniques described in this step are not limited to visualization techniques or simple basic statistics. Other techniques and methods, such as modeling, clustering, classification, etc., may also be useful for exploratory analysis. However, the statistical and technical context is not enough; you must have a good understanding of the business and be able to recognize whether the results of the models are meaningful and relevant. 6.3.5. Modeling Now that you have completed the data mining phase and understood the data you have prepared, it is time to move on to the next phase: the model creation phase. This model will be based on a data-set that is representative of the phenomenon (problem) you are trying to model.
96
Sharing Economy and Big Data Analytics
This phase is much more specific than the exploratory phase, because you know what you are looking for and what you want to achieve. The construction of the model therefore depends on the available data and the techniques you wish to use. It should be noted here that when you obtain this model, you will also need to analyze the questions below: – does the model really answer the initial questions (objectives) and how? – can I easily implement it to explain the phenomenon under study? – are there any constraints or considerations that you have not taken into account? If your model can answer these kinds of questions, then you probably have a model that can reach productive conclusions. Before making the model operational, you must assess the quality of the model, in other words, its ability to accurately represent your initial problem. This will ensure that it meets the objectives formulated at the beginning of the process. This assessment also contributes to the decision to deploy the model or, if necessary, to improve it. At this stage, the robustness and accuracy of the models obtained are tested. 6.3.5.1. Testing the model’s performance After obtaining the model, the question of its performance arises. Indeed, here you will find out to what extent the model really explains the phenomenon. To do this, you will need to review the performance of the model and determine whether it meets all the requirements to be able to complete the project. In this context, there is a set of techniques that you can use to test the model’s performance. These tests will allow you to quantify the behavior of your model in order to understand how it will react. To do this, we invite you to present the problem in a different way and test other analytical techniques before committing to a specific result. 6.3.5.2. Model optimization Finding a consistent predictive model that is becoming well established is an iterative and empirical process. To do this, you will alternate between the modeling phase of the predictive system and the performance measurement phase. During these iterations, you will refine your hypothesis on the data, as well as on the characteristics that come into play in the prediction.
The Art of Analytics
97
During this phase, you may need to look for more data if you realize that the data you have used is insufficient. In this case, you need to take a step back from the data preparation and exploratory analysis stage in order to integrate this new data. Once this part is completed, you can take action. 6.3.6. Deployment By following the previous steps of the data analysis process, you are making better decisions for your project. At this point, we can say that the choices have been guided and oriented by the data you have collected and prepared, explored and analyzed in a robust way. By performing the various tests, your model becomes more accurate and relevant. This means that you will make informed decisions to manage your business effectively. The emphasis put on the end-use is therefore one of the key factors in the data analysis process. With clear data in place and a good understanding of the content, you will be able to create one or more models in order to make better forecasts, rank your priorities or better understand the phenomenon you modeled. Once you have acquired it, your model will be ready to be presented. This involves planning the deployment of your model, its monitoring and maintenance. The deployment can therefore range from the simple generation of a report describing the knowledge acquired, to the implementation of an application allowing the use of the model obtained, for the prediction of unknown values of an element of interest (Sedkaoui 2018b). At this point, you may think you have reached the end of the process flow, but this is not the case. Because it is during this stage that your general skills will be most useful and extremely important. If you have done things right, you now have an operational model and stakeholders. This last step involves using the results of your data analysis process to determine the best course of action. This process may seem a little abstract at the moment, but we will deal with all these points in detail in the last part of this book. You will see a great commonality between all these steps. 6.4. Conclusion Once you have collected the data you need, it is time to move on to the analysis. But to do this analysis, a structured approach is needed in order to better generate value. If you have done things right, you now have an operational model that you
98
Sharing Economy and Big Data Analytics
can use. We can therefore conclude this chapter here and move on to the relationship between Big Data analytics and the sharing economy. TO REMEMBER.– In this chapter, you have had the opportunity to learn: – the different types of analysis you can use, namely: - descriptive analysis (which does exactly what its name says, “describe”), which is used to provide an overview of the past and answer the following question: “What happened?”; - predictive analysis, which uses statistical models and forecasting techniques to understand the future and thus answer the following question: “What could happen?”; - prescriptive analysis, which uses optimization and simulation algorithms to advise on potential outcomes, to answer questions such as: “What now?” or “What needs to be done?”; – the different steps of the data analysis process: - definitions of objectives and requirements: defining what, why and how; - data collection: search and capture of all necessary data; - data preparation: data verification and correction (cleaning, etc.); - exploration and interpretation: analysis of data using statistical and visual techniques in order to better understand them; - modeling: the use of advanced techniques (Machine Learning algorithms) for modeling according to the initial objective; - deployment: moving from data to knowledge (using the model).
7 Data and Platforms in the Sharing Context
Where ideas are lacking, one word always arrives on time. Johann Wolfgang von Goethe
7.1. Introduction The advent of Internet applications and mobile technologies, changes in general attitudes, and the greater attention paid to sustainable consumption during the last few years have led to a new context characterized by sharing. This context has been widely discussed in Part 1 of this book to foster understanding of the characteristics of an economy based on sharing, its operation, and its promises. Several sharing economy companies around the world reacted positively to the trends in this economy and have affirmed their position within their business sector. Airbnb, for example, the leading accommodation sharing platform, has more than three million listings, and its hosts receive more than 150 million guests worldwide. Then there’s Uber, which has been revolutionizing the transportation industry since its creation in 2009 and currently operates in more than 50 countries and 250 cities worldwide. Not to mention, of course, BlaBlaCar, Lyft, Drivy, Quora, TaskRabbit, Wework Djump, Deliveroo, Haxi, Didi, CouchSurfing, Zipcar, Bag Borrow, Steal, Poshmark, etc., which rely on the intensive use of digital platforms and which owe their success to the sharing economy.
Sharing Economy and Big Data Analytics, First Edition. Soraya Sedkaoui and Mounia Khelfaoui. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.
100
Sharing Economy and Big Data Analytics
The sharing economy has become popular and is based on new business models, in which access to goods and services can be easily shared. The term itself describes a phenomenon that is based on the sharing, access, or provision of goods or services, labor, and underutilized financial resources involving sources of demand, resource and services providers, and digital platforms. Bike sharing, carpooling, and the sharing of tools, beds, food, etc., has grown. Overall, in just a few years, this economy has greatly changed the industrial landscape and the way we do business, affecting the value creation process. It’s important to keep in mind in this context that, without data and the application of various analytical techniques, neither Uber, Airbnb, BlaBlaCar, Lyft, nor any other company born in this economy would know how to create value and boost their business model. Obviously, the success of every innovative business model in the sharing economy context is based on a digital platform that allows different participants to share goods and services. But, in reality, these platforms could not function or ensure their success without effective techniques for data analysis. So, there are two essential ingredients for successful business models in the sharing economy: data and analysis. Indeed, the adoption and successful implementation of this paradigm, based on sharing, remains a challenge. To address this, we argue in this chapter for the use of Big Data and for harnessing the potential of Analytics as a key element and a fundamental basis for companies in the sharing economy. Moreover, many studies and research suggest that big data can open up new opportunities and generate operational and financial value (Ohlhorst 2013; Morabito 2015 Henke et al. 2016; Foster et al. 2017; Sedkaoui 2018a). Big Data opportunities seem endless, so before showing how this phenomenon can be an ally and an accelerating factor for the sharing economy, we will present a selection of Big Data applications, through case studies from the current business world. These examples show the interest that large companies are paying to the analysis of large volumes of data and how it has enabled them to implement their strategies and enhance their competitiveness. Large companies such as Facebook, Amazon, Netflix, etc., leaders in their field, who have proven the power of data and have developed their strategies based on their analysis. This chapter will also provide an opportunity to understand the importance of the “data/platform” duo that has shaped the ecosystem of the sharing economy. But first, let’s take a close look at how the web giants have taken advantage of Big Data.
Data and Platforms in the Sharing Context
101
7.2. Pioneers in Big Data As we stated in the previous chapters, having more data is practically synonymous with more value and more precise targeting that takes into account the operationalization of the company’s various business processes. Faced with the 3 V’s of Big Data , as previously clearly described, many companies are engaging in this context to generate value. The goal is to be successful by applying advanced analytics to large amounts of data in order to uncover and discover hidden patterns and correlations that are barely noticeable. Many still think that Big Data and Analytics are only abstract concepts. But the strength of Amazon, like Netflix, Facebook, Google, Walmart, and many others lies in the analysis of the data they collect to understand the behavior of their customers and to improve the decision-making process. Born in the digital era of which they were the primary architects, these companies have clearly benefited from the extreme digitization of their respective activities. For years, these large companies have been investing in the construction of data centers and have also deployed solutions for data analysis. A “data center” is a facility consisting of networked computers and a storage device that businesses and other organizations use to organize, process, store, and distribute large amounts of data. A company usually depends on many applications, services, and data contained in a data center, making it a focal point and a vital asset for daily operations. Box 7.1. Datacenter1
Want to know how these companies are able to exploit large quantities of data to operationalize their various activities and to assist them in their decision-making process, with some going so far as to personalized offers? We’ll find out in the various examples discussed in the following section. 7.2.1. Big Data on Walmart’s shelves When we talk about big data, it’s hard not to think of the example (diapers and beer) often used to describe the importance of analyzing massive amounts of data. Of course, this is about Walmart and its experience in data analysis in the retail sector. 1 Source: https://www.lebigdata.fr/definition-data-center-centre-donnees.
102
Sharing Economy and Big Data Analytics
With tons of petabytes of usually unstructured and constantly generated data, Walmart was able, through the exploitation of the potential of Analytics, to optimize its distribution process. All Walmart decision-making processes rely on data extraction and the analysis of customer behavior and product inventories. Whether internal or external, the analysis of petabytes of the company’s data has optimized its various operations and identified all sort of correlations in order to understand customers and anticipate their needs. Let’s return to the example of diapers and beer. After learning that young men often buy diapers and beer between 5 pm and 7 pm, a simple reorganization of the stores led to an explosion in the sales of these products (Sedkaoui 2018b). This realtime correlation system, operating with a constant flow of data, has enabled Walmart to increase sales by simply placing these two products closer to each other. Another approach that enabled Walmart to strengthen its decision-making process is that the e-commerce giant uses data from social networks like Facebook, Twitter, etc. Many of its decisions focus on the analysis of this data to generate useful information. Walmart also exploits large quantities of data to support its recommendation system. For this, the company has developed an application that provides links to the products so that customers can have easy access. The app is called Shopycat, and it can recommend gifts for friends based on data from social networks extracted from their Facebook profiles. In this context, Big Data analytics is an opportunity for retail to seize in order to provide an increasingly rich and innovative customer experience. 7.2.2. The Big Data behind Netflix’s success story Why Netflix? Because this company, founded in 1997, analyzes everything we watch and interprets the results to derive trends for future productions. It began when the BBC broadcast the series House of Cards in 1990. Wondering how Netflix has grown so fast? Simply by analyzing the tastes of subscribers who liked the first version of this series. Netflix found that subscribers also watched other films with Kevin Spacey and/or directed by David Fincher. Netflix is largely based on a thorough knowledge of its subscribers as a constant source of service improvement. This knowledge enables it to offer real-time personalized recommendations, contrary to what was happening just a few years ago, when it was necessary to go out to buy a DVD to watch.
Data and Platforms in the Sharing Context
103
Netflix has tapped into the potential of analyzing data in different ways, and Table 7.1 summarizes some applications of Big Data. Application Design testing Guiding creative choices Personalized content recommendations
How? Any changes to the platform whatsoever are first tested by users Creating an algorithm to discover many things about users (as in the example of House of Cards) Use customer knowledge to personalize and optimize services
Methods A / B testing Forecasting, modeling Modeling, classification, clustering, etc.
Table 7.1. Various Big Data applications at Netflix
Netflix has therefore upset the systems for personalization and recommendation, like Facebook, Amazon, and others. By clicking on “watch”, the data analysis process makes it possible to visualize the diversity of a subscriber’s choices and to later recommend movies or series. Yes! This is because, in order to access content from Netflix, the subscriber must have a personal account that contains personal information. This data is used by the Netflix system. In addition to this data, each subscriber leaves behind tracks that this system can follow. Everyone watches movies or series based on their taste, which is identifiable. But people also watch movies or series based on their availability or the date and time, and they will watch a certain number of times. Netflix collects this data and combines it with sociological characteristics and dozens of other criteria. Once the data is collected, it is analyzed and cross-referenced for groups of users (subscribers) with different characteristics in order to recommend movies or series to them. 7.2.3. The Amazon version of Big Data If we only visit an e-commerce for a specific product, our interests cannot be clearly defined. Accordingly, data from news sites and other social networks will be analyzed to better understand what we are interested according to the information and research on the web, in order to subsequently recommend articles tailored to us. That’s what Amazon does in order to suggest products to us. Like Walmart, Amazon uses data from Facebook, Twitter, Google, etc. to refine its targeting and to better understand our interests. Amazon is also another example that illustrates the importance and potential of the analysis of large amounts of data in real-time.
104
Sharing Economy and Big Data Analytics
Thanks to the extremely advanced data analysis algorithms that Amazon has developed to anticipate our evolving needs, the company can suggest products that fit our needs by analyzing the traces we have left during our last online session. Now in use for more than 20 years, the innovations achieved by Jeff Bezos’ company are strongly linked to the exploitation and analysis of data; and these innovations are still in progress, with Amazon Prime Now, launched in June 2016, and Amazon Pantry, launched in March 2017. Amazon Web Services (AWS) is one example. AWS wants to leverage Amazon’s experience in analyzing advanced data, artificial intelligence, etc. to simplify the construction of models and to accelerate learning in order to make Machine Learning accessible. Amazon Echo is a step in this direction, and now no one has to move to shop, because a simple conversation with the Alexa’s virtual assistant will do the job. As the foundation of the Echo intelligent home assistant, Alexa tends to become indispensable. This assistant is integrated directly into the functions of washing machines, refrigerators, vacuum cleaners, TVs, etc. These devices respond to more or less sophisticated voice commands for communicating directly with the Alexa assistant. The objective of creating such an assistant is to facilitate the use of devices, such as starting or stopping a wash cycle, checking and setting the refrigerator temperature, starting a recording on the connected TV, etc. With this assistant, the customer will certainly be satisfied, since Amazon’s goal is symbolized by its logo: a smile. 7.2.4. Big data and social networks: the case of Facebook Have you ever shared videos that Facebook has suggested to you? Do you know what we’re talking about? We’re referring to the “flashback”: the short preview of your posts, “likes”, pictures, etc. like those you may have received to celebrate your years on Facebook, or when Facebook suggests celebrating the anniversary of when you “became friends” with someone, etc.? If you have ever had these experiences, you have participated in one of the examples of Big Data applications used by this social network. These examples, which retrace your activity since you registered on the network, illustrate one of the ways in which Facebook leverages the power of data. Facebook’s data analysis is not limited only to data generated from your News Feed or your sharing activities, “likes”, etc. Rather, the data collection network can
Data and Platforms in the Sharing Context
105
incorporate everything from your mailing address to the battery power on your smartphone. Data is transferred daily to our smartphones and other connected objects. Are you having trouble imagining this? Nevertheless, Facebook can even follow you on the web using tracking cookies. In general, cookies are small text files retrieved based on the ID tags stored in the browser’s directory or program data subfolders on each computer. They are created when a person uses the browser to visit a website which, in turn, uses cookies to help pick up where it left off and can store recorded logins, theme selections, preferences, and other customization features. Box 7.2. Cookies
So, if you log into your Facebook account and you browse other sites (music, online sales, etc.) at the same time, know that Facebook follows you and knows which sites you’ve visited. Using this data, Facebook shows specific ads that may interest you. Thus, Facebook can use the photos you’ve shared or liked to track you on the Internet and on other Facebook profiles. Using facial recognition capabilities and image processing, Facebook analyzes these photos, provided by sharing, to track you. This is the Facebook version of Big Brother is watching you, the famous sentence from George Orwell’s 1984, published in 1949. You’ll understand that using the example of Facebook should not be surprising, since this social network is considered one of the largest suppliers of Big Data, to which it owes its success in part. So even if the largest social network is essentially free for users, Facebook profits by analyzing its users’ data. 7.2.5. IBM and data analysis in the health sector As in all areas, the health sector as a whole is gradually being overwhelmed by the 3 Vs phenomenon. Through the collection and analysis of large amounts of data, the use of different techniques for prevention, treatment, diagnosis, and monitoring of patients is changing rapidly (Sedkaoui 2018a). In this sector, the analysis of large data-sets can also extract relevant information from the medical history of each patient. Big Data analytics facilitates the collection
106
Sharing Economy and Big Data Analytics
and analysis of both structured and unstructured data in real-time from a growing number of sources, including medical records, electronic surveillance systems, and others. For this, IBM has developed a program, IBM Watson, that is able to store, understand, cross-reference, and exploit medical data to diagnose and analyze patients’ health. Looking through newspaper articles, tweets, and blog posts, “IBM Watson” seems to be able to provide powerful features to the health sector. IBM was able to, as a result, extract the notes taken by doctors during consultations to diagnose heart failure, and to develop an algorithm that summarizes the text using a technique called Natural Language Processing (NLP). Like a cardiologist, a computer can now read the doctor’s notes and determine if a patient has heart failure (Sedkaoui 2018a). Watson can also make custom recommendations by reading the patient’s genome. Watson is positioned as a health cloud, aiming to create a real-time ecosystem for cross-referencing (anonymized) patient data. Impressive, isn’t it? How have data analysis algorithms allowed Amazon and other companies achieve this success? After all, Big Data is just a large data-set from various sources, in different formats, which can be processed and analyzed appropriately. That said, it’s easy to see that, since its creation in 1994, Amazon has built a culture based on data and its analysis. The key factor is knowing how this data can be correlated and formatted to make sense of it through analysis. Emerging companies in the context of the sharing economy use digital platforms to connect people around the world and also to collect data and analyze it. In what follows, we will see how big data and data analysis capabilities helped create the context of this economy and boost its business models. It is this question, which you may have been asking yourself since page one, that this book seeks to address through its various chapters, which began by immersing you in the context of the sharing economy, showing you the power of Big Data analytics, and now intends to help you understand the link between the two. 7.3. Data, essential for sharing The various sharing economy practices and emerging business models have led us to the need for better cooperation, but also for better data sharing between
Data and Platforms in the Sharing Context
107
platform participants and users. The different models adopted by sharing economy companies are made possible by analyzing the large amount of data they collect from their users via digital platforms. The success of many data-driven models in the sharing economy confirms the effectiveness of Big Data analytics. This is the case for Uber, Airbnb, Lyft, BlaBlaCar etc. These companies have understood that each byte of data contains messages and underlying secrets that can be revealed. They have developed innovative approaches to data collection and analysis, and these methods are largely responsible for their success. This leads us to say that Big Data is not just behind the success stories of large companies, but also behind the success of many experiments in the sharing economy. A significant number of startups and entrepreneurs are going into this area, striving to follow the data revolution and, why not, to move up and to start their own business model. So, they found that it is possible to use Big Data analytics not only to differentiate their business models, but also to innovate. Have you noticed that this difference, of which Cukier spoke (Chapter 5), is found here? Yes! He asserted that “more data” leads us to more innovative ideas (“More is new”), that good results are likely to help guide the decision-making process (“More is better”) to finally be able to differentiate and stand out from others (“More is different”). The secret lies in the ability to continuously extract the value of the data warehouses that are produced in various forms and from many sources. In other words, the key to unlocking the secrets behind each byte of data is to create different ways of doing so. This involves: – exploring data from different platforms and practices in the sharing economy; – the development of relevant means and methods for extracting the knowledge that can lead to successful strategies. In other words, we must develop a data-driven approach and culture, because the success of the implementation of a data-driven culture (see previous chapters) is an important factor for carrying out the Big Data analytics process. Let’s now take a look at the importance of the “data and digital platforms” duo in the sharing economy.
108
Sharing Economy and Big Data Analytics
7.3.1. Data and platforms at the heart of the sharing economy Because of the emphasis on sharing and collaboration, the activities of the sharing economy are often seen as very different from those provided by private and public channels. The different models in this economy have provided sources of additional or alternative income. Basically, the sharing economy is a specific business activity or a new type of business model (OECD 2016). It is significant that companies like Airbnb and Uber were both founded, in 2008 and 2009 respectively, during the financial crisis. As such, this economy is portrayed as a way to present socioeconomic values through new forms of consumption, production, cooperation, participation, and sharing. Indeed, the idea of sharing goods and services is not new; in fact, it is an ancient phenomenon. For example, community libraries allowed people in the eighteenth century to lend books to and borrow them from their neighbors. But the degree of sharing was very limited in the past, both because of the difficulty of matching supply and demand, but also because of the lack of trust between people. However, the emergence of the Internet and social media and the rise of digital technologies and connected objects helped to fill these gaps. The term “sharing” has expanded (John 2017) and has become one of the fundamental cultural values of the digital environment (Stadler and Stülzl 2011). This has therefore launched a movement to make the most of these technologies, especially digital platforms, and to work towards the benefit of all. Of course, to rent a room on Airbnb or to offer available seats on BlaBlaCar or to become an Uber driver, you need a digital platform. To connect the owner of a source (good or service) to a customer, you need a space that connects the two parts and that allows them to gain an advantage: everything must be “win/win”. For example, the idea of renting your room, which is located in Paris, for a week to a person from London seems to be a very smart idea. Not only will the property be used for the intended purpose, since you’re offering up a property you’re not using to someone who actually needs it. But, furthermore, this use will generate additional income. For that, the person from London who wants to rent a room needs a simple tool; namely, a digital platform that lists all available properties. Each property must list information about its defining characteristics. In other words, the exact location – the address in Paris, the price, facilities, etc.
Data and Platforms in the Sharing Context
109
You, in Paris, also need this platform to reach a large number of interested customers. This platform allows you to specify the characteristics of your apartment or property (address, price, etc.). By communicating accurate information through this platform, users will be able to find your property and communicate easily with you. In return, this property will earn you money. We are clearly seeing a connection between supply (offering a room in Paris) and demand (looking for a room in Paris), by providing participants with the necessary information. This connection is therefore based on profit, but also on mutual trust, since the transaction and the consumption rely on the relationship between these people via digital platforms, where information is transparent and symmetrical. In this context, the idea of sharing requires three elements that foster a sense of social cooperation and integrity: – mutual understanding; – goodwill; – honesty. Several companies have developed their own platforms based on these three elements to allow owners of goods and services and users of these goods and services to connect for the benefit of both. Uber, Airbnb, and BlaBlaCar are obvious examples, and they represent platforms among many other platforms, too numerous to mention. Thousands of homeowners put the characteristics of their apartments and properties online to attract users’ attention. But what do these characteristics really represent? In other words, when you create a full online reputation profile, you’re not just enabling communications and safer transactions on these platforms; you’re sharing something else. What are you sharing? The answer to this question can only be massive amounts of data. Particularly of unstructured data generated by thousands of users on the sharing economy’s many platforms. Data is therefore at the center of this movement and of all these different categories (P2P, B2B, B2C, C2C, etc.). And companies that do business in this economy should not only think about ways of doing business or creating new services, but also about effectively using large amounts of data to deliver goods and services to people who want it. It’s important to see that data – like the platforms that are at the heart of the sharing economy, and that, unlike companies based on an economy without sharing, such as search engines – is very valuable to the platforms’ fundamental activities.
110
Sharing Economy and Big Data Analytics
7.3.2. The data of sharing economy companies You can see from the previous sections that the emergence of the sharing economy and its different categories influences not only demand but also the entire value chain. Its impact on different parts of the value chain, including production, logistics, product design, and supply management, becomes clear. This is enhanced by increased computer technology, the Internet of Things (IoT) and blockchain. Blockchain technology has become a key topic in the “tech” scene. This technology has the potential to provide benefits for a variety of activities and daily business processes (Gatteschi et al. 2018). Born with the Bitcoin digital currency in 2008, blockchain makes it possible to store data and carry out transactions in a decentralized and secure manner. Box 7.3. Blockchain technology (Zahadat and Partridge 2018)
For sharing economy companies, the data that is captured, stored, and analyzed along this chain is indicative of value creation. This means that, in order to meet the needs of owners and users of different platforms, companies like Uber, Airbnb, BlaBlaCar, etc., need to manage all of their data in real-time. The question now is: what types of data must these firms manage and analyze in order to generate value? And what are the different challenges that these companies face? To answer these questions, we need to dive back into Chapter 5 of this book, where we showed that, just as these data can have different units of measurement (they are now being measured in petabytes, exabytes, and zettabytes), they also occur in a variety of forms and can be classified into three categories (Table 7.2). Category
Characteristics
Examples
Structured data
Organized and structured; may be stored in a database Easy to store and analyze (relational databases)
Databases
Semistructured data
Not stored in a relational database, but has organizational properties facilitating analysis
Web, logs, XML
Unstructured data
Difficult to codify and exploit, requires tools and advanced software for analysis
Images, videos, data from social networks
Table 7.2. The different types of data
Data and Platforms in the Sharing Context
111
In the context of the sharing economy, we can cross-reference these three categories of data (structured, semi-structured, and unstructured) by analyzing the large quantities that are produced. But the most relevant data for companies in the sharing economy is relational data (Smichowski 2016) that reflects the way users interact on a platform. In this framework, we distinguish three types of data, without neglecting the type already illustrated in Table 7.2: – data provided directly by a user, such as profile information, photos, contact lists, etc.; – data on user behavior from a platform or browser; – data generated by the analysis of the previous two data types. Managing these different types of data emanating from the sharing economy’s platforms favors cooperativism, which can maximize the economic and societal effects, etc. But in the sharing economy, the first type of data (provided by a user) is highly relational. That is to say that if, for example, a person X (from London) rents the house of another person Y (in Paris) using a platform, this data refers to these two persons. This may also be the case with some data provided, such as photos where several people appear or a contact list, etc. This poses a problem because almost all individual users must agree that the data can be used. But the context of big data involves more than these two persons (X and Y), as in our example. If fact, the opposite is true, because Big Data analytics is only useful when large amounts of data are analyzed. What counts is the use of data, which involves millions or even billions of units of different types of data. So, different people use sharing platforms that each gather large amounts of personal data about consumers, potentially creating challenges for protecting this data. In the case of data-driven services, as provided by most sharing economy platforms, the problem lies not only in the sharing, but also in the data, or in a sufficient amount of data, and consists of how to exploit them. 7.3.3. Privacy and data security in a sharing economy We can agree that when we use platforms such as Airbnb, Uber, Lyft, BlaBlaCar, and others, we are not only offering or sharing our properties, or looking for specific goods or services; we are also sharing our personal data.
112
Sharing Economy and Big Data Analytics
Yes! Our apartments’ addresses, the facilities available, our credit card information when we make transactions, our destinations, availability, etc. We are communicating this data on sharing economy services platforms. Are you aware that we are providing this data to other users via some companies’ platforms – which have no vehicles and have not recruited a single driver (Uber, BlaBlaCar), which have no houses, apartments, or hotel rooms (Airbnb) – which do not have the goods and services that we are sharing through their platforms? These platforms are designed for a specific purpose – driving, accommodation, self-employment – from X to Y. Via these platforms, X and Y will: – create profiles; – describe their property (goods or services); – search in lists based on predefined criteria; – book rooms or apartments in advance; – make payments safely. These platforms are therefore digital spaces where we can find or develop several functions at once, all while providing the necessary data. From an ethical point of view, each of these functions poses one or more problems related to the protection of participants’ privacy. Although these platforms allow users easy access to the sharing of goods and services with a single click on their connected devices and objects, it is important to examine some questions and requirements that seem to be neglected and that are related to data security and confidentiality. Different types of data described previously must be protected in the context of the sharing. To explain this, let’s refer to our example. The example of X, who has a room in Paris, and Y, who is looking for one in the neighborhood. In this case, X will, at first, create a profile, on the Airbnb platform, for example, which incorporates some basic details such as name, username, etc., which must be visible to all. This type of data can be stored in a distributed storage system in unencrypted format. To describe his or her property, X must add additional information to the profile that will be publicly visible, such as the description of his or her apartment, photos, etc.
Data and Platforms in the Sharing Context
113
This description contains data related to the apartment’s address: the neighborhood where the apartment is located, its proximity to public transportation, supermarkets, tourist attractions, parking, etc. This information is very helpful to potential client Y, who can easily access these details. X can also set the apartment’s availability. But Y has no way to access this information (the apartment’s availability), which can be only be managed by X. X must update this information in order to approve new requests from other clients. Notice also that X and Y can communicate securely in private. Once Y confirms the reservation, this decision is only shared with X. The payment details between the two parties remains anonymous and must be fully secured. Obviously, this security also concerns Y’s trip, because he or she doesn’t want this information to be visible to other people. However, some metadata related to booking and payment can be published to show that X actually has the number of people it claims to have. On the other hand, Y will have temporary access to data about the apartment (the exact address, door codes, etc.). After his or her stay, they receive a notification to provide his opinion, which future clients will be able to see. The example shows that there is a range of requirements for data privacy via shared platforms. This concerns not only the Airbnb platform, but all sharing economy platforms. The previous example allows us to see that the following two statements highlight a dual aspect (Ranzini et al. 2017): – The first concerns the security of the data exchanged between users and the companies that created these platforms. In return for the security of their data, these users are able to participate on these platforms. But it should be noted here that these platforms are reluctant to share data from their users, which is important for better estimating the impact of the sharing economy (Frenken and Schor 2017). This can harm the platforms themselves, by limiting the potential size of the sharing economy. – The second is related to the security of the data that users exchange with each other in order to access goods or services. The creation of sharing platforms thus raises privacy protection issues to the extent that they involve not only the sharing of goods and services but also the sharing of data by simply using the platforms. The emergence of this economy has raised challenges concerning the private lives of participants that involve the sharing of their data. It would take a whole book, or even a series of books, to analyze these points in detail. But we wanted,
114
Sharing Economy and Big Data Analytics
through this short discussion, to draw your attention to confidentiality and data security, which are two elements to be considered in the context of the sharing economy. 7.3.4. Open Data and platform data sharing In the context of Big Data, the challenges faced by businesses in the sharing economy extend beyond privacy and security. Other challenges, such as problems related to the complexity, scalability, heterogeneity, quality, and timeliness of data, are also a major concern (Table 7.3). These problems should be taken seriously by companies in the sharing economy during the analysis of massive data-sets and the development of their analytical process. This must be done in advance, even before building a data-driven approach. But beyond all of that, another very interesting point we wanted to look at here is sharing and ease of access to the large data-sets that are generated. We believe that, contrary to what its name suggests, the sharing economy, or, more precisely, this economy’s platforms, do not consider the value of sharing their users’ data. Challenge
Problems
Complexity
Large amounts of data emanate from different sources and are produced in different forms in real-time.
Scalability
Large-scale data. The presence of mixed data based on different models or rules (heterogeneous mix of data). The properties of the models vary widely.
Heterogeneity
Data can be both structured and unstructured. It can be very dynamic and have no particular format. More data does not always mean that it is the right data.
Quality
Timeliness
The size of the data-sets to be processed increases in real-time. The results of the analysis are required immediately.
Table 7.3. Challenges associated with Big Data
Collaborating and making data open will result in greater value for all stakeholders. But, unfortunately, this is not the case.
Data and Platforms in the Sharing Context
115
We need data to build trust with other users and to improve our reputations. Enhancing this reputation can be presented as a factor motivating users to interact in a community (Wasko and Faraj 2005) and as a commitment based on trust between users (Utz et al. 2012). In other words, we need to communicate with trustworthy people. Because, even if the platform allows us to connect with other users, no one is willing to connect with someone who cannot be trusted. Since data quality depends on the amount of available data, the user needs to access data from other users of the platform. By connecting to these virtual spaces where billions of people cooperate and work independently (Rifkin 2014), we transmit large amounts of data. Harnessing this data can increase the benefits of the sharing economy, and users and participants can improve their reputations by sharing information. It is clear that the economic benefit of the sharing economy lies in the efficient use of resources, resulting in reduced transaction costs (Lobel 2018). But sharing data would provide access to large data-sets, which would open up many opportunities to improve and operationalize various practices in the sharing economy and to create new business models. To exploit the potential of large data-sets, data must be shared, and its reuse must be allowed. Building a data-driven culture depends on easy access to this data as a value-generating resource. However, as mentioned previously, companies always seem reluctant to share their data. One of the main responses raised by private platforms is that their competitors could access their data, because there is no single platform supporting transparent data sharing. Therefore, the key question is how this sharing can be encouraged. Maybe you’ll think of something that would boost the culture of data sharing. For example, helping businesses understand the value of their data, or creating a platform for sharing data-sets in a transparent manner, etc. But the more reasonable approach is to focus on the platforms. The different sharing economy platforms are far from being considered as simple mechanisms for data access. They represent a third party that enables a systematic exchange of data flowing between users. As we noted at the beginning of this chapter, these platforms exist, as does data, at the heart of the sharing economy’s various practices. They represent the foundation of users’ activities by giving them access in exchange for providing data (Srnicek 2017).
116
Sharing Economy and Big Data Analytics
It is time therefore to rethink data sharing and openness for effective reuse for the economy in general. This means that the techniques and algorithms for data analysis are essential for better leveraging the flow of data. In this context, companies like Uber, Airbnb, BlaBlaCar, Lyft, and others must follow an approach based not only on data but also on making it available, or Open Data. This is very important for drawing more benefit from the third-party services (platforms) and data reuse. 7.4. Conclusion The new sharing-based economy is in the process of transforming businesses and IT in all sectors, and the results are visible to all of us in our daily lives, whether we’re using Uber for city travel, Airbnb to book our next vacation, or BlaBlaCar to find a rideshare. The idea behind these platforms is not only to allow for a quick connection to anyone around the world, but also to gain potential benefits. The distinct characteristics of each platform and their various abilities to execute a task (finding the best match, suggesting the users one wishes to contact, predicting our needs, etc.) are the results of data analysis. This is explained by the technical properties of the use of Big Data. Do you now understand the importance and power of this duo: “data and platforms”? Let’s now summarize the key points discussed in this chapter. TO REMEMBER.– This chapter provided the opportunity to learn that: – technological development and the emergence of platforms has facilitated communication and interceded between the owners of goods and services and their clients; – data is at the heart of the sharing economy, and Analytics is the tool that ensures smooth operations and leads to the creation of value; – Big Data allows companies to better guide the decision-making process and to operationalize various activities; – large companies have experimented with and developed various applications and solutions using Big Data and the power of Analytics;
Data and Platforms in the Sharing Context
117
– the data produced in the context of the sharing economy takes different forms (text, photos, etc.) and is both structured and unstructured; – the proper application of data analysis for sharing platforms depends on the quality and capacity to provide personal data security. Collaborating and making data open will result in greater value for all stakeholders.
8 Big Data Analytics Applied to the Sharing Economy
Things only get done if the data we gather can inform and inspire those in a position to make a difference. Mike Schmoker
8.1. Introduction The effectiveness of an approach based on the analysis of large data-sets must be evaluated based on the reason for its use (the why?). The biggest question for Big Data, as with any major innovation, lies in the value that this phenomenon is likely to generate (the what?) and in its implementation (how? and for whom?). If we cannot identify ways to understand this approach (how?) and its benefits (for whom?), it will be difficult to understand the importance of Big Data analytics for the sharing economy, who should make it happen, and why it is so important. There was a time when Big Data was considered a buzzword that appeared on the covers of many books, magazines, reports, special issues of journals, and conferences throughout the world, as if to signal an increased need for processing and analyzing growing volumes of data. For 10 years or more, the scientific community has not stopped analyzing the deluge of data and exploring its business value, while discussing its various opportunities and challenges. A number of companies have understood its importance and are committed to the exploitation of its potential, like the pioneers
Sharing Economy and Big Data Analytics, First Edition. Soraya Sedkaoui and Mounia Khelfaoui. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.
120
Sharing Economy and Big Data Analytics
who showed the way to effectively use Big Data and Analytics. But these uses are not restricted to these companies, and it would be an error to think that they are. Many companies, regardless of industry or size, can benefit and take advantage of Big Data applications. According to studies and reports published by McKinsey, Cesco, PwC, etc., the analysis of large amounts of data provides greater value to businesses by reducing costs and creating new innovative ideas. This is why the data-driven approach has in recent years been the subject of major hype. This approach has led to the emergence of a number of business models, which have all created their own way of doing business. This is the case, for example, for some start-ups and entrepreneurs who have noticed that a number of assets (goods and services) are not effectively exploited by the parties that hold them. Many of the products we buy are only used for a certain period of time and are then set aside. What if we could find one or more people who might need it? This is the question that these innovative business models have considered. They were able to see the potential financial benefits in these resources, simply by facilitating their sharing. This is a fascinating idea that led to the emergence of the sharing economy. Frederic Mazzella of BlaBlaCar, for example, decided to make carpooling more accessible by creating a solution that allows drivers to find passengers. The idea dates back to 2003, when he struggled to get home for Christmas. All trains from Paris to the Vendée (where his hometown is located) were full. Ultimately, his sister made a detour to drive him to the Vendée. During the trip, the young programmer realized that several seats were available, but that they weren’t on the trains, but in hundreds of cars. So, all one had to do was to look for a car that was going in the same direction and to share the costs, instead of looking for a seat on the train. Mazzella’s idea is only one example among many others, such as Uber and Airbnb, which have shown us that there is a solution for every problem. We just have to find new approaches to solving them. But one thing is clear: the ideas created by BlaBlaCar, Uber, Airbnb, etc. cannot be implemented without determining what has made their development possible (how?), and of course, the target audience (for whom?). These companies use data to determine what to develop and whom to target in order to create opportunities in untapped markets for sharing.
Big Data Analytics Applied to the Sharing Economy
121
Several studies have noted the potential of the large amounts of data that are generated and collected by sharing platforms. The analysis of this data makes it possible not only to promote the performance of these business models and to operationalize their activities (micro level), but also to predict economic outcomes (macro level) such as inflation, unemployment, housing prices, etc. (Einav and Levin, 2014; Wu and Brynjolfsson 2015; Glaeser et al. 2017). This chapter will be an opportunity to learn about a variety of analysis techniques and algorithms, including Machine Learning algorithms. We will explain how they can be applied in the context of sharing. Thus, we will, through this chapter, see how sharing economy businesses use Big Data to generate value. But first, we will discuss the different processing and visualization technologies that these companies can use. 8.2. Big Data and Machine Learning algorithms serving the sharing economy The sharing economy has transformed the way participants and users communicate and connect with each other throughout the sharing process. But behind this process lie a variety of situations and more or less complicated steps that are applicable in different ways. Given their changing nature, it is very difficult to understand sharing practices, because sharing as we know it today is completely different from how it was practiced just a few years ago, and it will undoubtedly change again tomorrow. Indeed, in recent years we have witnessed the use of methods and highly advanced IT tools that were previously available only to large companies. This has facilitated access to many ways to create innovative ideas. Some are successful by bursting value chains and by seriously upsetting established players: Uber for taxis, BlaBlaCar for intercity carpooling, Airbnb for hotels, etc.; this is, of course, just the beginning, because the trend is accelerating. The companies of 2019 are organized so differently from those of 2009, which, in turn, worked differently from companies in 1999, etc. This paradigm shift involves new modes of interaction between actors, which leads to a re-examination of the fundamentals of traditional channels and an understanding of new forms of collaboration. Thus, the digital revolution has led to a data revolution, which gave companies the power to collect large data-sets. The defining characteristic of this trend therefore concerns Big Data.
122
Sharing Economy and Big Data Analytics
All sharing economy platforms rely on data analysis to develop practices and to determine whom to target. In addition, these platforms analyze this data to create custom goods and services. This data is increasingly used today because of the combination of a number of factors that we have already discussed in previous chapters. These are (Sedkaoui 2018a): – continuously decreasing data storage costs; – increasing computing power; – the explosion of large amounts of data, the nature of which is largely unstructured, which requires different operating techniques compared to traditional analysis methods. Big Data and the sharing economy are the two areas that have most marked the digital ecosystem. The analysis of large volumes of data has allowed sharing economy companies to launch many applications for users. The use of platforms by these users leads in turn to increasing amounts of data. This requires more advanced data analysis techniques, thus opening the way for Machine Learning algorithms that adapt to the different characteristics of Big Data and become more effective in identifying patterns and extracting knowledge. 8.2.1. Machine Learning algorithms The ability to generate value in the sharing economy context and to make Big Data more profitable depends largely on the ability of businesses in this economy to analyze the available data. But how can this be done? The answer is: Machine Learning algorithms. Why? Because to generate value you need data, but you also need techniques that will allow you to make better use of it. These techniques are algorithms that represent the act of analysis and give value to your data. As a result, Big Data is the very essence of these algorithms, while the latter are the means by which to fully exploit its potential. Machine Learning algorithms can learn from large amounts of data or from actual observations (Sedkaoui 2018a). As this data comes from different sources in different forms, its analysis can be of varying levels of complexity. This is why these algorithms give a machine the ability to learn without being explicitly programmed (Samuel 1959). What? Does this seem hard to grasp? Do you need more clarity?
Big Data Analytics Applied to the Sharing Economy
123
Okay: think about an example that seems closer to our daily life: a spam filter. Machine Learning first analyzes how to classify our emails by deducing specific criteria (words, concepts, symbols, etc.), which the algorithm will use to classify the mail as spam or not. The spam filter is one of the first applications of Machine Learning algorithms, which are embedded in most current email applications (Gutierrez 2015). This is also the case for the analysis of website content. When you start your search, search engines can identify the most important words and phrases for returning the most relevant results (Witten et al. 2016). So, we can deduce from these two examples that Machine Learning algorithms are designed to extract a degree of regularity that makes learning possible. This extraction is mainly related to different stages of the Big Data analytics process that we discussed in more detail in Chapter 6. Given that various Machine Learning algorithms are useful throughout the data analysis process, it is not surprising that a considerable number of technologies have been developed to simplify their applications. We will return to this point later in this chapter. For now, we will look at the various Machine Learning algorithms that we can use to extract value. In this context, these algorithms are categorized into two main categories: – supervised algorithms: achieve results and learning by trying to find patterns in a set of known data. This is the case, for example, for spam filters; – unsupervised algorithms: in this case, the input data have no known class structure, and the task of the algorithm is to reveal the data’s structure. Take for example, the example of a new Amazon user who is looking for a specific product. Here, Amazon’s system doesn’t have any data about this user. The user will be associated with a group of clients who looking for the same product. The algorithm will therefore categorize the user based on similarities in the data. Whatever type of algorithm you adopt for analyzing large amounts of data, you should know that algorithms are not all designed for the same purpose. They are generally classified into two types based on: – their mode of learning; – the type of problem to be analyzed. This is what we will discuss in the next point.
124
Sharing Economy and Big Data Analytics
8.2.2. Algorithmic applications in the sharing economy context There are a large number of algorithms belonging to these two categories that we can use to analyze Big Data. We list the most common algorithms in Table 8.1. Regression and classification are of paramount importance in the data analysis process and appear to be the most widely used Machine Learning algorithms. Their uses are varied. For example, we opt for classification when we want to look for words (people, places, etc.) in a text or when we try to identify people in pictures or voice recordings, etc. However, regression is used to be treated how susceptible people are to spend in exchange for a good or service, the number of customers who may be interested in specific resources, to help Uber drivers, for example, to predict which car parts are likely to fail, to prevent payment fraud, or to create a robust ratings system. Analysis
Simple
Complex
Algorithms
Learning mode
Simple regression
Supervised
Multiple regression
Supervised
Naïve Bayes
Supervised
Logistic regression
Supervised
Hierarchical classification
Unsupervised
K-means
Unsupervised
Decision tree
Supervised
Random Forest
Supervised
Bootstrap
Supervised
Support Vector Machine (SVM)
Supervised
Neural Networks
Supervised
kNN
Supervised
Problem to be treated Regression Classification
Cluster analysis
Classification / regression
Table 8.1. The various applications of Machine Learning algorithms
These two methods, regression and classification, can be coupled together and applied to identify potential customers for apartments or houses on Airbnb or to look for tourist destinations near a specific location. These are very important techniques, but the list of Big Data analytics applications is not limited to these two algorithms.
Big Data Analytics Applied to the Sharing Economy
125
There is also cluster analysis, which is another algorithm that we can use to analyze data from sharing economy platforms. This type of analysis can be used to analyze the attitudes of sharing economy platform users; for example, by associating socio-demographic factors with other variables related to sharing, to find groups (or clusters) based on these factors. This analysis can also be applied to examples where we seek to assess the motivations of users based on their sharing behavior. This is possible in the case, for example, of peer-to-peer platforms. This list of applications for (supervised and unsupervised) Machine Learning algorithms can only be seen as a taste of what’s to come, because it is everywhere in the sharing economy’s various practices. We believe that the deployment of the sentiment analysis algorithm is of greater importance because it will allow sharing economy companies to efficiently automate user feedback platforms by creating graphics for assessing their satisfaction. We will come back to these different algorithms in Part 3 of this book, where we will work with examples from Airbnb’s databases. These examples will help you to understand the advantage of Big Data analytics in the context of this economy. For now, you need to be familiar with the different automated solutions and Big Data technologies that facilitate the use of data and simplify the analysis process. 8.3. Big Data technologies: the sharing economy companies’ toolbox To fully exploit the potential of Big Data, companies need powerful tools to process, analyze, and store the quantities of data they produce and collect daily. Before the emergence of the 3 Vs phenomenon, companies depended on traditional analysis techniques. However, in the context of Big Data, these techniques are less effective for extracting knowledge from huge volumes of data because they react more slowly and lack scalability, performance, and accuracy. Much work has been done to meet the complex challenges of Big Data and to exploit the power of data. Various types of tools and technologies have been developed to streamline the data analysis process and create an ideal environment for the application of different Machine Learning algorithms. Hadoop, Spark, Python, Matlab, R, and other tools and features for managing and analyzing big data offer a variety of algorithmic applications.
126
Sharing Economy and Big Data Analytics
Given its importance, the data analysis process in the Big Data era – which begins, as we have seen, with a definition of the goal and involves collection, preparation, modeling, and finally action – is based on these technologies. This process requires advanced tools and begins with the choice of the tools needed to ensure the effective use of data. In this context, we must first make this choice, which can be identified through three criteria. Each criterion corresponds to different needs and levels (Sedkaoui 2018b): in other words, if you're just starting to analyze data relating to your business, your start-up, or your e-commerce site, for example, you can use “tools for beginners” such as Google Analytics, Google Tag Manager (GTM), Regex101, Excel, etc. If you have a larger budget for your big data analysis project and want to analyze your data in more detail, you should use advanced tools, or “standard tools”. These tools can address more specific needs than the tools previously mentioned, like Optimizely, Dataiku DSS, Crazyegg, Mixpanel, etc. On the other hand, if you have a team dedicated to analyzing Big Data and want to use your data for very specific purposes beyond a certain level of data analysis, it is best to create your own tools (scripts, pipelines, automation, etc.), with “technology for experts”. This requires delving into the world of coding and programming. Among standard tools for data analysis, we should mention Hadoop, MapReduce, Spark, Python, etc. When you read the word “coding”, you may imagine that it’s very complicated and too technical for you. But know that there is a first time for everything, and that you need to be informed about technological advances in the context of Big Data. If you’re just beginning in this field, we suggest you start by familiarizing yourself with the first group of tools (beginner level). The passage from “beginner” to a higher level, “standard” or “expert”, will happen gradually, as your needs become defined or as your databases grow. In the sharing economy context, the importance of Big Data is not dependent on the amount of data that businesses hold but on what they can do with it. They can use and analyze data to reduce costs and time, to develop new services and optimized offers, to make intelligent decisions, etc. The most important thing here is to know what they want to do with their data because it ultimately determines the choice of the technology used. It’s time to look at the most powerful tools that facilitate the analysis of data. This is, moreover, the purpose of this section: provide a mini-encyclopedia of various Big Data technologies that companies in the sharing economy can adopt.
Big Data Analytics Applied to the Sharing Economy
127
We will review the different technologies that allow these companies to analyze and question their data. 8.3.1. The appearance of a new concept and the creation of new technologies Many researchers point out that Big Data cannot be analyzed by relying on traditional methods (Chen and Liu 2014) and that the emergence of this phenomenon has required the development of a set of IT tools with a new form of integration that makes it possible to take advantage of data available in various forms that is produced in real-time. All the power of Big Data is therefore based on technology. As Big Data continues to grow around the world, more and more companies are becoming interested in related technologies for analyzing their data and better manage their strategies. It is the giants of the web, as mentioned in the previous chapter, such as Google, Amazon, and Facebook, who are behind the development of these new tools. Technologies such as Hadoop, MapReduce, and Apache Spark, etc. have become standard tools for working with such data. And with Hive, it’s possible to manipulate data on Hadoop via SQL queries, etc. The list of technologies that offer innovative solutions for leveraging Big Data contains many tools. But what are the most popular technologies that companies can use and that provide the best storage, exploration, analysis, and data visualization? 8.3.1.1. The Hadoop ecosystem The first and most popular solution is probably Hadoop and its ecosystem. Originally developed by Yahoo and now supported by the Apache Foundation for the creation of applications that can store and process large amounts of structured and unstructured data, Hadoop is now most widely used for analyzing Big Data. It is an open source technology that is specifically designed to store very large (structured and unstructured) data-sets. Hadoop uses a distributed file system (HDFS) and a map reduction engine. This technology is supported by an ecosystem that extends its functional scope. For example, HBase (NoSQL database) or Hive (data warehouse with SQL query language). Specifically, Hadoop includes a data storage feature called the “Hadoop Distributed File System” (HDFS) and a feature for information processing:
128
Sharing Economy and Big Data Analytics
MapReduce. Hadoop is composed of several components: an HDFS storage system, a YARN process planning system, and the MapReduce processing framework. 8.3.1.2. Apache Spark Apache Spark is part of the Hadoop ecosystem, but its use has become so widespread that it deserves a separate category. Apache Spark is an analysis tool for large amounts of data that works with all major programming languages, including Java, Python, R, and SQL. It is also very flexible and works with Hadoop, for which it was originally developed. This tool provides integrated functionality for real-time data analysis, SQL, Machine Learning, graph processing, and much more. Apache Spark is optimized to operate in memory and can also enable continuous interactive analysis. Unlike batch processing, large amounts of historical data can be analyzed alongside real data to make real-time decisions. This well-known analysis platform guarantees quick, easy to use, and flexible computing. Indeed, Spark runs programs faster than Hive and Apache Hadoop standard/(MapReduce). This tool is widely used by companies of all sizes, from the smallest to technology giants such as: Apple, Facebook, IBM, etc. It is very useful for predictive analytics, fraud detection, sentiment analysis, etc. 8.3.1.3. NoSQL databases Traditional relational database management systems (DBMS) are used to manage selected business data but are not able to store large-scale data with rapid processing. NoSQL (not only SQL) databases offer a new approach to data storage that is more flexible, more scalable, and less vulnerable to system failures. NoSQL databases are specialized in the storage of unstructured data and provide fast performance. The most popular NoSQL databases include MongoDB, Redis, Cassandra, Hbase, and many others. 8.3.1.4. In-memory databases In any IT system, memory, also called RAM, is much faster than long-term storage. If a large data analysis solution can process data stored in memory rather than data stored on a hard drive, its performance will be significantly faster. And that’s exactly what in-memory database technology does.
Big Data Analytics Applied to the Sharing Economy
129
Many of the major enterprise software vendors, including SAP, Oracle, Microsoft, and IBM now offer in-memory database technology. In addition, several smaller companies such as Teradata, Tableau, Volt DB, and DataStax offer inmemory database solutions. 8.3.1.5. Keep in mind Several solutions can also be implemented to optimize data processing, speed up request processing time, increase security, etc. With all these solutions, illustrated in Table 8.2, we can now learn more about Big Data and its ecosystem by examining the problems faced by companies in their projects. These technologies also include: – real-time processing solutions: these are solutions that improve processing times. They are often used as the basis for the implementation of scalable solutions. Indeed, through this method, it is not necessary to wait for data processing to be completed before accessing the results; – data security solutions: more and more businesses have data security concerns. According to the Big Data Executive Survey 2017 (NVP 2017) report, the most popular technology solutions for data security include identity and access controls, encryption, and data separation; Field Data storage Data analysis Data integration Data security Distributed programming Machine Learning Other
Technologies MongoDB, ElasticSearch, Flooring, Redis, MemCache, RainStor, VoldMort, Accumulo, HBase, Hypertable, Cassandra, Neo4j, blockchain, etc. IBM SPSS Modeler, KNIME, RapidMiner, Apache Spark, Presto, Sap HANA Blackboard Splunk Hunk, etc. Flume, Sqoop, Nifi, Storm, Flink, Scribe, Chukwa etc. Twill, Apache Ranger, Hama, blockchain, etc. MapReduce, Pig, Samza, Kudu JAQL, Spark, PigGen, Senty, Ranger, etc. Mahout, Weka, Onyx, H2O, Sparkling, Water, MADLib, Spark, Python, R, Julia, etc. Thrift, ZooKeeper, Tika, GraphBuilder, Oozie, Falco, Mesos, Hue, Ambari, etc. Table 8.2. Big Data technologies by field
– Machine Learning: as we have described earlier in this chapter, Machine Learning is also a technology that uses algorithms to enable analysis and use of Big Data;
130
Sharing Economy and Big Data Analytics
– blockchain technology: this distributed data technology is based on the Bitcoin digital currency. Its uniqueness lies in the fact that once data is entered, it cannot be deleted or changed. This is a highly secure solution and an excellent choice for Big Data applications, especially in banking, insurance, healthcare, retail, etc. As you can see, Big Data technologies are diverse and constantly changing. These solutions enable the analysis of large volumes of data and the extraction of good results. These results will be more understandable if we can transform them and present them in a visual form. As already noted by William Playfair in 1780: “... making an appeal to the eye when proportion and magnitude are concerned, is the best and readiest method of conveying a distinct idea.” This quote is useful for illustrating what in the Big Data universe we call data visualization. 8.4. Big Data on the agenda of sharing economy companies In recent years, there is probably no better example of Schumpeterian innovation than Uber, Airbnb, BlaBlaCar, etc., which defy worldwide traditions in different sectors with their platforms. Thus, recent technological developments (mobile technology, autonomous driving, artificial intelligence, blockchain, etc.) have enabled many innovations in mobility services, transportation, logistics, accommodations, recruiting, etc. The successful launch of the sharing economy has therefore redefined the landscape in these sectors. A variety of intelligently designed alternatives heavily based on trust have been identified as opportunities for many people looking for an alternative source of income. Overall, this new wave of innovation provides technical support and the basic economic culture necessary for solving problems, as well as the requirement of value, and also provides guarantees of basic cooperation and reciprocity for economic actors. This new wave is based on the phenomenon of Big Data and the opportunities that can be generated by analytical practices. Without data analysis, companies cannot meet the needs of their customers in real-time. Identifying resources and targeting the people who need it, informing those who have them, and facilitating communication between the various participating parties, appears to be a huge task. Sharing economy companies therefore need Big Data analytics to see market trends and the various opportunities created by sharing. Without real-time analysis
Big Data Analytics Applied to the Sharing Economy
131
of large amounts of data, these models cannot achieve the value offered by the context of sharing. Many companies have realized how important this is and have oriented themselves towards building a culture that promotes sharing and that is based on and directed by data (data-driven). Being a data-driven company means that there is a continuous conversation in both directions happening between data and the company’s strategy. This is what we will examine in the following section: how have these various models used and benefited from Big Data analytics? 8.4.1. Uber We cannot talk about examples of sharing economy companies without citing the case of Uber. This American company, founded in 2009, has undoubtedly become a “Big Data Company”. After 10 years of existence, the company is ranked at the top of the global transport sector. The various Big Data applications used by Uber teach us much about the effective use of data and its analysis. How does Uber use its data, and what can we learn from these uses? At Uber, the combination of Big Data technologies and Machine Learning algorithms have made it possible to analyze different types of data in order to understand the behavior, locations, and preferences of its customers and to more effectively manage the availability and positioning of the driver. When you request an Uber, algorithms get to work, and, within seconds, they put you in touch with a matching driver (closest to you). Behind the scenes, Uber stores a huge amount of data about all trips that are taken. Algorithms then use this data to predict supply, demand, and pricing. Furthermore, these algorithms also examine how routes are managed in different cities in order to adapt them. Yes! It’s amazing, but true. With Big Data applications, Uber can thus investigate its customers’ complaints, for example in the case of speeding. Uber represents a successful model of working with Big Data analytics. This is not only a question of capturing large amounts of data; its success is primarily based on its capacity to collect relevant data to connect supply and demand for services. Who needs a car and where? It is by focusing on these two pieces of data that Uber has managed to make taxis obsolete. Uber needed to know exactly where potential customers are to automate the decision-making process when sending drivers.
132
Sharing Economy and Big Data Analytics
In addition, Uber has developed algorithms to monitor traffic conditions and travel time in real-time (Marr 2016; Sedkaoui 2018b). Consequently, prices can be adjusted according to demand and travel times. This pricing method, based on the analysis of large data, has been patented by Uber. This is called surge pricing or the implementation of “dynamic pricing”, already used by airlines and hotel chains to adjust prices to demand in real-time through predictive analysis. The challenge is to guide people by offering the ability to know where the best opportunities are. Knowing where there are more trips makes it possible to determine a value card for cities. This success has not escaped the transport market leader. In 2017, the American company launched Uber Movement. This is a portal that allows cities to register and receive personalized statistics on traffic in their streets, from anonymous data transferred by drivers’ and customers’ smartphones (Sedkaoui 2018b). Uber’s data-driven approach allows it to expand its market and to operate in over 450 cities around the world. This approach also concerns data visualization. Uber seeks to optimize urban travel time in major cities. Initially, the company was interested in setting up a demand-based pricing system (the Geoserve system based on the level of supply and demand depending on location). The data from this system was used to optimize the movement of drivers in order to reach a new understanding of mobility. It can reveal urban rhythms based on a real map of the city as a function of sectors and traffic schedules. This giant has opted for a “multi-cloud” strategy, and its architecture combines many development languages, large databases, advanced technologies, etc. You may be thinking that Uber uses these technologies and algorithms just to ensure your travel. In reality, Uber is far from a mere platform that connects thousands of users. Uber has changed our culture of transportation through the power of data. Not only by using cars, but also luxury cars, food, helicopters and more. Uber immediately knew the value was there in the data-sets, that exploration was needed to derive value. 8.4.2. Airbnb Airbnb is the most popular platform for seeking accommodations or a rental property in cities across the world. Airbnb has distinguished itself and has turned into an empire spanning several countries. If you’ve already used this platform, you may have noticed that it offers locations that meet your preferences. This platform suggests the best price ranges according to your means.
Big Data Analytics Applied to the Sharing Economy
133
To help you find what you are looking for and to respond to all of its users, Airbnb uses its data repository to identify what may be helpful to you, to others, and for the company as a whole. There’s no question that Big Data analytics is behind this work. Just like Uber, Airbnb also presents itself as a perfect example for reminding you of the importance of data analysis. Several Big Data applications have been adopted to provide the best services to its users. Airbnb uses Big Data for: – search processing and personalization: this is the basic principle of the Airbnb platform, which assists its users throughout their search. Depending on each search, algorithms analyze data in real-time and create offers to match each person’s wishes; – price prediction: to determine the value of an apartment or a house, Airbnb uses a pricing algorithm called Aerosolve. This algorithm will take into account many variables: the city, month, type of property, transportation, etc. In addition to traditional variables, Aerosolve also analyzes images to determine the price (Sedkaoui 2018a); – facilitating employee tasks: while the algorithm helps rental companies set their prices, Airbnb also provides a platform for its employees to ask questions and make decisions. In recent years, many employees have used this platform, which contains both structured and unstructured data (Marr 2016): images, rental data, the number of rooms, various events, etc.; – assessing users’ opinions based on comments: we’ve mentioned this before. It’s possible to analyze and evaluate the users’ (positive and negative) comments. With these applications and many more, data doesn’t just produce valuable information that helps guide Airbnb’s decision-making process and operationalize its various activities. Beyond that, the company considers this data as the voice of its customers. Thanks to the data-driven approach that the company has adopted, its lens has widened to amplify these voices. In other words, to determine which actions or decisions to take or not to take, to illustrate the potential of these voices, and to improve business performance. 8.4.3. BlaBlaCar BlaBaCar was developed based on a data-driven approach to better anticipate and meet the needs of its customers. By using Big Data and data analysis, BlaBlaCar tries to understand user behavior to expand its services. This has allowed the company to identify a specific need from the analysis of generally unstructured data.
134
Sharing Economy and Big Data Analytics
The case of BlaBlaCar in the Big Data arena illustrates the potential of Analytics and what it can bring to the management of a carpooling service. In addition to A/B testing, its work with Big Data analyzes user behavior and optimizes its interface, the performance of its customer relationship management campaigns, and the ergonomics of its platform. Since its creation in 2006, BlaBlaCar has gone from a small French start-up to a major player in European transport (Sedkaoui 2018b). BlaBlaCar’s rapid growth relies in large part on the success of its marketing campaigns and its platform’s ability to build customer loyalty. Far from being a coincidence, these two strengths are based on the judicious confluence of a Hadoop cluster and a platform for analyzing large amounts of data. Using its data-driven approach, BlaBlaCar processes the data of its customers and deploys new features to meet their needs. To quickly and safely interpret its data, the French carpooling unicorn collaborated with Vertica. Thus, the company uses data visualization, integrating data from its users on social networks and geographic data. In this context, BlaBlaCar built a back-office polygon mapping tool to simplify the management of its analytical platform. 8.4.4. Lyft Nowadays, sharing economy companies are investing a considerable amount in Big Data projects. This is the case, for example, for Lyft, which is considered one of the most dynamic companies for transportation, next to Uber, of course, in the United States. This company, which revolutionized the transport sector by putting drivers and users in touch through mobile applications, is available in over 200 cities and provides more than 15 million trips per month. Its applications generate large amounts of data that the company collects and analyzes. Data about drivers and users, data on vehicles and their locations, data on speed and acceleration, etc. Lyft’s data is diverse and voluminous. Just like Uber, Lyft collects all of this data and analyzes it to monitor the functionality of the most frequently used services, to analyze usage patterns, to determine customer waiting time, etc. All of these elements must be implemented in real-time. To do this, Lyft uses a variety of tools and advanced technologies that have evolved over the years (Yang 2018) to analyze large amounts of data. It has shifted from AWS products, for managing its exponential growth, to Big Data technologies like Apache Hive, Presto, Flink, Kafka, etc. (Table 8.3). Lyft’s objective is clear:
Big Data Analytics Applied to the Sharing Economy
135
meet its clients growing use by adopting an approach based on the analysis of data in order to solve scalability and competition issues. Year
Adopted Technologies
2015
Auto Scaling, Amazon Redshift, Amazon Kinesis, MongoDB, DynamoDB, Amazon EC2 Container Registry
2016
Hive, Spark, Airflow
2017
Presto, Kafka, Flink
2018
Druid, Superst
Table 8.3. The evolution of Lyft’s data analysis pipeline
Lyft has embraced AWS products and significantly expanded its use as and when they become available. The company uses products like Auto Scaling to manage up to eight times more cyclists during rush hours and Amazon Redshift to obtain client information that feeds the company’s Lyft Line service. Lyft also adopted Amazon Kinesis to channel production events through the system and exploits the scalability of Amazon DynamoDB for several types of data, including a route tracking system that stores the GPS coordinates of all trips. Thus, Lyft relies on Amazon’s ECR infrastructure (EC2 Container Registry) to store photos and images from applications and to transfer them safely to data processing systems. Lyft continually updates its data analysis process by integrating the most advanced technologies, such as: Apache Hive, Kafka, Flink, Druid, etc. Ranging from Machine Learning to artificial intelligence through advanced analysis of users’ data, the analysis of data from connected objects, supply chain optimization, fraud detection, visualization, etc., Big Data applications and the philosophy of Analytics within Lyft are constantly growing. 8.4.5. Yelp Created in 2004 by Jeremy Stoppelman, Yelp invites customers to write reviews and evaluate their experiences with companies. This allowed Yelp to receive more than 95 million comments by the end of 2015. Since its inception, Yelp has grown
136
Sharing Economy and Big Data Analytics
from a service offering in San Francisco to an international presence covering a large number of companies in over 30 countries. On Yelp.com, customers discuss their preferences and recommendations. Yelp.com also allows users to search for a business, a hotel, for example, and to find its location and even to read reviews from previous users. This will not only allow this hotel, but also a large group of companies, to have an idea of what customers think and therefore to better focus their strategies and future actions. By posting comments and reviews on businesses, Yelp provides companies the opportunity to improve their services and to help users choose the best available company. For example, thanks to information on the pages of its health services suppliers, some restaurants pages now include inspection scores. By linking people to local businesses, Yelp provides detailed data on customer experiences with each company, through comments, advice, registrations, and commercial attributes. These details provide an ideal scenario for companies because it allows them to reach their target audience; that is, people who are actively seeking their services nearby. However, it is not possible for these companies to go through all the feedback from users and to make important decisions to improve their service. This is where Big Data analytics comes in by integrating the data into several categories of opinions. Using this data stream, companies can discover new opportunities and learn the characteristics of their target audiences. They can also learn information about what people like, where they live, and even what they are doing at specific times. The challenge for these companies can be summed up in one statement: determine how to best use this data to improve strategies and achieve specific objectives. Yelp therefore understood the potential of analytics and the power of “data” and advanced technology. To better understand market trends and users’ needs, Yelp relied on the analysis of data generated daily by users (Glaeser et al. 2017). Yelp’s databases contain large amounts of users’ data and store information about them, such as their name, the number of reviews they have published, their usage time on Yelp, their friends, etc. The client reviews are collected on Yelp.com,
Big Data Analytics Applied to the Sharing Economy
137
and methods such as text analytics and sentiment analysis are used to leverage this data. These data mining methods make it possible to find meaningful structure in customers’ opinions when they rate and publish comments about a company. 8.4.6. Other cases Based on the sharing economy concept, many companies are using Big Data analytics to match supply and demand and advanced algorithms to set prices and to forecast demand, etc. 8.4.6.1. TaskRabbit TaskRabbit offers a marketplace that links those who need help – TaskPosters – to TaskRabbits who can meet those needs and have the skills to perform the requested task or tasks. Like other models in the sharing economy, TaskRabbit has a data warehouse containing system events to analyze and then to create models for managing related needs. When it comes to Big Data analytics, TaskRabbit employs the techniques required to: – build a service roadmap for which a feature will be created. In this case, TaskRabbit compares user profiles to the data and discovers information that will be shared in the Tasker app; – develop the service itself, using an algorithm that allows Tasker to provide specific information (ratings, recommendations, etc.) to users, by determining the order in which the Taskers be displayed. In other words, if the Tasker can complete that task, if he or she is available for the given time, if they can work in the appropriate place, etc. This recommendation principle is very important, and it is similar to other concepts adopted by Amazon, Yelp, Quora, etc. TaskRabbit has created a model with features that optimize both the selection of a Tasker by the customer and the completion of the task by Tasker. This model takes into consideration factors such as the Tasker’s experience, the type of task, etc. Before moving on to another case, it should be noted that TaskRabbit developed another pricing service based on data analysis. This service allows users (Taskers) to increase their chances of being hired on the platform by changing their prices at any time.
138
Sharing Economy and Big Data Analytics
8.4.6.2. LaZooz By installing this application, users can share data on their movements and trips with other users in the community. The application, which exploits data from its users, uses Machine Learning algorithms and predictive analytics to monitor the number of active users and to determine, for example, when a given region has reached a critical mass. Once the algorithms detect this situation, a carpooling service will be made available to these users in real-time and enabled on the application. The algorithms are regularly updated to adapt to the needs of the community. The LaZooz platform sought to make the most of existing infrastructure, while maintaining affordable rates for its services. LaZooz, which offers a carpool service, also relies on a new version, based on blockchain technology. As we mentioned before, this technology can be used to analyze large volumes of data. The introduction of this technology, almost ignored by sharing economy companies, has enabled the platform to synchronize availability of seats in vehicles in real-time. 8.4.6.3. Mobike and Ofo Chinese sharing platforms such as Ofo and Mobike record bicycle trips through their applications. By using data analysis algorithms and adopting advanced technology, both models have participated in the modernization of the transport sector in China. The Chinese bike sharing company Mobike, supported by Tencent, has set up a stationless bike sharing system to allow customers to take or leave a bike at any legal parking area at any time. The specially designed smart locks allow for sharing and are considered a new sharing model. On the other hand, Ofo, supported by Alibaba and DidiChuxing, have developed fierce competition with Mobike. Ofo launched its urban traffic management platform Qidian, covering 200 Chinese cities, and makes its large volume of data available to the government. Furthermore, its Singularity platform shares 40 terabytes of data daily on the deployment of bicycles, locations, travel, etc. In undertaking this approach, Ofo has opened the door to a myriad of opportunities for the use and analysis of data. Given this situation, and to ensure user privacy and data security, Mobike also opened its data by cooperating with the government and scientific research institutes to promote intelligent management in this sector. We therefore find that the battle
Big Data Analytics Applied to the Sharing Economy
139
for territory in China is moving from a competition for market share to a battle on the Big Data front. 8.4.6.4. Other models based on data analysis If Uber uses its vast data stores to predict where and when a customer wants a taxi and to raise prices when demand is high, sites such as Kickstarter and GoFundMe are key elements of the sharing economy. They allow individuals to turn their ideas into reality and also to use data to ensure that their ideas have a greater appeal and, therefore, receive more funding. Recognizing the potential of the sharing economy coupled with the analysis of large amounts of data, some companies are taking steps to transform their business models (Botsman 2014). For example, DHL realized that its practice of dropping packages off to its customers in collection points rather than delivering them to their homes frustrated customers. To alleviate this problem, it launched MyWays in 2013, which allows peers (people wishing to carry packages on demand) to buy and deliver packages on the last kilometer to DHL customers, thanks to crowdsourcing. Similarly, realizing that it had unused meeting space, Marriott has partnered with LiquidSpace, a marketplace that helps people find places to work. Other sharing economy models range from “brand as a service”, as in the case of Whole Foods, which partnered with Instacart, or even BMW with Daimler by respectively offering the DriveNow and Car2Go on-demand transport services, to “advertising partnerships” such as those between KLM and Airbnb or Lyft and MasterCard (Owyang 2015), etc. 8.5. Conclusion In this chapter, we took a tour of the various Big Data technologies. We have learned about the variety of Big Data applications used by some companies in the sharing economy. These companies were able to follow the pioneers and take advantage of this phenomenon by adopting a data-driven approach, and, of course, by using a variety of analytical techniques. To succeed in the sharing economy context, one must add a new skill to the toolbox (advanced data analysis technologies, Machine Learning, and various data analysis algorithms, which will be the subject of the last part of this book). This is also an opportunity for you to come to know this area, with enough theory to
140
Sharing Economy and Big Data Analytics
understand the implications of the data analysis methods, but especially by using your computer and some software or programming languages. TO REMEMBER. Throughout this chapter, we learned that: – technologies like Hadoop, MapReduce, Spark, etc. facilitate processing and storage of large amounts of data; – the two categories of Machine Learning: supervised and unsupervised algorithms are adapted to different characteristics of Big Data and become more effective for identifying patterns and knowledge extraction; – various algorithms can be used to generate value in Big Data: regression, classification, and cluster analysis; – companies like Uber, BlaBlaCar, Airbnb, and others analyze their data to generate value.
PART 3
The Sharing Economy? Not Without Big Data Algorithms
Sharing Economy and Big Data Analytics, First Edition. Soraya Sedkaoui and Mounia Khelfaoui. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.
9 Linear Regression
Art, like morality, consists of drawing the line somewhere. Gilbert Keith Chesterton
9.1. Introduction In the previous section, we discussed the existence of relationships or “correlations” between variables, and the task of discovering these correlations in order to measure the strength of the relationships among several variables. But in most cases, knowing that a relationship exists is not enough; to better analyze this relationship, we have to use “linear regression”. Linear regression, which we will look at in this chapter, can be used to model the relationships between different variables. Globally, the approach on which this method aims to explain the influence of a set of variables on the results of another variable. In this case, it is called a “dependent variable” because it depends on other variables (independent variables). Regression is an explanatory method that makes it possible to identify the input variables that have the greatest statistical influence on the result. It is a basic algorithm that any enthusiast of Machine Learning practices must master. This will provide a reliable foundation for learning and mastering other data analysis algorithms. This is one of the most important and widely used data analysis algorithms. It is used in many applications ranging from forecasting the prices of homes, cars, etc. or the weather, to understanding gene regulation networks, determining tumor types, etc.
144
Sharing Economy and Big Data Analytics
In this chapter, you will learn about this algorithm, how it works, and its best practices. This chapter will explain how best to prepare your data when modeling using linear regression. Through case studies and using Python, you’ll learn how to make predictions from data by learning the relationships between the characteristics of your data and continuous value observed responses. This is an introduction to this method in order to give you enough basic knowledge to use it effectively in the process of modeling for the other kinds of questions you’re likely to encounter while working on Big Data projects. Let’s examine how to define a regression problem and learn how to apply this algorithm as a predictive model using Python. Note that we are not going to go into detail about the code: for this we recommend that you refer to specialized work and other information on the subject and that you work through it. 9.2. Linear regression: an advanced analysis algorithm Linear regression is a data analysis technique used to model the relationship between several variables (independent variables) and an outcome variable (dependent variable). A linear regression model is a probabilistic model that takes into account the randomness that may affect a particular result. Depending on previously known input values, a linear regression model predicts a value for the dependent (outcome) variable. This method is mainly used to discover and predict the cause and effect relationships between variables. When you only have one independent variable (input variable), the modeling method is called “simple regression”. However, when you have multiple variables that can explain the result (dependent variable), you must use “multiple regression”. The use of either of these methods depends mainly on the number of independent variables and, of course, on the relationship between the input variables and the outcome variable. Both simple and multiple linear regression are used to model scenarios for companies or even governments. In the sharing economy context, regression can be used, for example, to: – model the price of a house or apartment depending on the square footage, number of rooms, location, etc.;
Linear Regression
145
– establish or evaluate the current price of a service (rental rules, online assessment notes, etc.); – predict the demand for goods and services in the market; – understand the different directions and the strength of the relationships between the characteristics of goods and services; – predict the characteristics of people interested in goods and services (age, nationality, function, etc.). Who takes a taxi, for example, or requests a particular service, and why?; – analyze price determinants in various sharing economy application categories (transport, accommodations, etc.); – analyzes future trends (predicting whether the demand for a home will increase, etc.). Overall, regression analysis is used when the objective is to make predictions about a dependent variable (Y) based on a number of independent variables (predictors) (X). Before diving into the details of this analysis technique, why should it be used and how should it be applied? Let’s see what a regression problem looks like. 9.2.1. How are regression problems identified? You do not really need to know the basic fundamentals of linear algebra in order to understand a regression problem. In fact, regression takes the overall relationship between dependent and independent variables and creates a continuous function generalizing these relationships (Watts et al. 2016). In a regression problem, the algorithm aims to predict a real output value. In other words, regression predicts the value on the basis of previous observations (inputs) (Sedkaoui 2018a). For example, say you want to predict the price of a service you want to offer. You will first collect data on the characteristics of people who are looking for the same service (age, profession, address, etc.), the price of the services already offered by others on a platform, etc. Then you can use that data to build the model that allows you to predict the price of your service. A regression problem therefore consists of developing a model using historical data to make a prediction about new data for which we have no explanation. The goal is to predict future values based on historical values.
146
Sharing Economy and Big Data Analytics
It should be noted that, before starting the modeling phase, three elements in the construction of a regression model must be considered (Sedkaoui 2018b): – description: the essential first step before designing a model is describing the phenomenon that will be modelled by determining the question to be answered (see the data analysis process described in Chapter 6); – prediction: a model can be used to predict future behavior. It can be used, for example, to identify potential customers who may be interested in a given service (accommodations, etc.); – decision-making: tools and forecasting models provide information that may be useful in decision making. Their activities and actions will lead to better results for the company through the “intelligence” provided by the model creation process. The model therefore provides answers to questions anticipating future behavior and a phenomenon’s previously unknown characteristics to identify specific profiles. How are models built? 9.2.2. The linear regression model As its name suggests, the linear regression model assumes that there is a linear relationship between the independent variables (X) and the dependent variable (Y). This linear relationship can be expressed mathematically as: =
+
+
where: –
: is the dependent (outcome) variable;
–
: are the independent variables (input) (for i = 1, 2, ..., n);
–
: is the value of
–
: is the change in
when
is equal to zero;
based on a unit change in
;
– : is a random error term representing the fact that there are other variables not taken into account by this model. The model is therefore a linear equation that combines a specific set of input values to find the solution that best matches the result of this set. This equation assigns a scaling factor to each input value, known as the parameter, which is represented by . Thus, another factor is added to this equation, called the intercept
Linear Regression
parameter ( ). Any calculation or selection of expected result for each input .
and
147
allows us to obtain an
This model is constructed in order to predict a response (Y) from (X). Constructing a linear regression model involves estimating the values of the parameters, for which and are used. For this, we can use several techniques, but ordinary least squares (OLS) is most commonly used to estimate these parameters. To illustrate how OLS works, imagine that there is only one input variable (X) required to create a model that explains the outcome variable (Y). In this case, we collect combinations ( , ) that represent the observations obtained, and we analyze them to find the line or curve that is closest to reality or that best explains the relationship between X and Y. The OLS method allows you to find this line by minimizing the sum of squares of the difference between each point and this line (Figure 9.1). In other words, this technique makes it possible to estimate the values of and so that the sum indicated in the following equation is minimized: −( +
)
We should also note that the “Gradient Descent” method is the most commonly used Machine Learning technique. We will not go into the details of how this method works, but you will find an overview of “Gradient Descent” below. When there are one or more input(s), you can use a process for optimizing the values of the parameters by iteratively reducing the error of the model for your data. This is called “Gradient Descent”, and it begins with random values for each parameter. The sum of squared errors is calculated for each pair of input and output values. To do so, a rate is used as a scaling factor, and the parameters are updated to minimize the error. The process is repeated until a sum of the squares error is reached or until no further improvement is possible. Box 9.1. Gradient Descent
It should be noted here that, when building a model, your goal is to find the model that reflects reality. In other words, one that best describes the studied phenomenon. In this case, it is worth mentioning that to build such a model, you need to minimize the Loss of information, which refers to the difference between your model, which presents an approximation of the phenomenon, and reality. In other words, the smaller the gap, the closer you are to reality and the better your model.
148
Sharing Economy and Big Data Analytics
But how can we reduce this gap? This is what we will discuss in the next section. Before turning to the application of the algorithm, we’ll examine the concept of loss, which is a very important concept in model construction that you need to know to understand linear regression. 9.2.3. Minimizing modeling error The whole reason for the existence of regression is the process of loss function optimization. This loss may be indicated by the so-called “error”, which refers to the distance between the data and the prediction generated by the model, as shown in Figure 9.1. Observations Les observations
La perte Loss
La droite Line (model) (le modèle)
Figure 9.1. Difference between model and reality. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
Determining the best values for parameters and will produce the line (the model) that most closely fits the observations. To do this, you need to convert this into a minimization problem in which you seek to minimize the error (loss) between the predicted value and the actual value. We have chosen to illustrate the loss graphically, because visualization is very useful for evaluating the distance between the observed values and the values calculated by the model. From this figure, we can see that in fact it is the distance between the observation and the line that measures modelling error. Basically, for each observation, you take the difference between the calculated value ( ) and the observed value ( ). Then you take the square and finally the sum of squares over all of your observations. Finally, you divide this value by the total number of observations. This short equation provides the Mean Squared Error (MSE) over all observed data points:
Linear Regression
=
∑
(
−
149
)
MSE is the most commonly used regression loss function. It shows the sum of the squared distances between your target variable and the predicted values. This is probably the most frequently used quantitative criteria for comparing calculated values and observed values. But it should be noted that we can prove mathematically that maximizing the probability is equivalent to minimizing the loss function. The goal is to converge on the maximum likelihood function for the phenomenon under consideration, by beginning from the initial observations. In this context, we can say that a large part of modelling lies in the optimization methods, that is to say, the methods that seek a maximum or minimum for a given function (Sedkaoui 2018a). Although linear regression can be applied to a result that represents multiple values, the following discussion considers the case in which the dependent variable represents two values such as true/false or yes/no. Thus, in some cases, it may be necessary to use correlated variables. To address these problems, the next section of this chapter provides some techniques that we can use in the case of a categorical dependent variable (logistic regression) or if the independent variables are highly correlated. 9.3. Other regression methods In the linear regression method, as we have previously explained, the results variable is a continuous numeric variable. But if the dependent variable is not numeric and is instead categorical (Yes/No, for example), how can we create a regression model? In such a case, logistic regression may be used to predict the probability of a result on the basis of a set of input variables. Thus, in the case of multicollinearity, it may be prudent to impose restrictions on the magnitude of the estimated coefficients. The most popular approaches are the socalled Ridge and Lasso regression methods. In this section, we will explain these three methods. Let’s begin with logistic regression.
150
Sharing Economy and Big Data Analytics
9.3.1. Logistic regression Logistic regression is a method for analyzing data-sets in which the dependent variable is measured using a dichotomous variable; in other words, there are only two possible results. This type of regression is used to predict a binary outcome (1/0, Yes/No, True/False) from a set of independent variables. To represent binary/categorical outcomes, we use nominal variables. This type of regression can be used to determine whether a person will be interested in a particular service (an apartment on Airbnb, an Uber, a ride in a BlaBlaCar, etc.). The data-set comprises: – the dependent (outcome) variable indicating whether the person has rented on Airbnb, requested an Uber, etc. during the previous 6 or 12 months, for example; – input variables (independent), such as: age, gender, income, etc. In this case, we can use the logistic regression model to determine, for example, if a person will request an Uber in the next 6 to 12 months. This model provides the probability that a person will make a request during that period. Logistic regression is considered a special case of linear regression when the outcome variable is categorical. In simple terms, this method predicts the probability that an event will occur by fitting the data to a logistic function (logit). A logistic function ( ), also called a “sigmoid function” , is an S-shaped curve that can take any real number and map it to a value between 0 and 1, as shown in Figure 9.2. Given that the range varies between 0 and 1, the logistic function ( ) seems to be appropriate for modeling the probability of occurrence for a particular result. As the value of increases, the probability of the result increases. After explaining the logistic regression model, we will examine the other models used to address problems of multicollinearity.
Linear Regression
151
1
0.5
–6
–4
–2
0
0
2
4
6
Figure 9.2. The logistics function
9.3.2. Additional regression models: regularized regression If several input variables are highly correlated, this condition is called “multicollinearity”. Multicollinearity often leads to estimates of parameters of a relatively large absolute magnitude and an inappropriate direction. Where possible, most of these correlated variables should be removed from the model or replaced by a new variable. In this case, we can handle this regression problem by applying particular methods. 9.3.2.1. Ridge regression This method, which applies a penalty based on the size of the coefficients, is a technique that can be applied to the problem of multicollinearity. When fitting a linear regression model, the goal is to find the values of the parameters that minimize the sum of the squares. In Ridge regression, a penalty term proportional to the sum of the squares of the parameters is added to the residual sum of squares. 9.3.2.2. Lasso regression Lasso regression (Least Absolute Shrinkage Selector Operator) is a related modeling technique wherein the penalty is proportional to the sum of the absolute values of the parameters.
152
Sharing Economy and Big Data Analytics
The only difference from the Ridge regression is that the regularization term is an absolute value. The Lasso method overcomes the disadvantage of the Ridge regression by punishing not only high values of parameter , but by setting them to zero if they are not relevant. Therefore, we can end up with fewer variables in the model, which is a huge advantage. After discussing the basics of modelling, linear regression, logistic regression and the Ridge and Lasso extensions, we now turn to action. 9.4. Building your first predictive model: a use case Having previously discussed the principle of the regression method, let’s now examine how we can build a model and use data. For this, we will use a simple dataset for applying the algorithm to “real” data. But keep in mind that our goal is far from a simple explanation to help you understand how to build a model for predicting the behavior of Y based on X. Our intent is to illustrate the importance of algorithmic applications in the sharing economy context. Therefore, in this chapter, as in other chapters of this final section, we will use databases from this economy’s businesses, which we have compiled from specific sources. Since this chapter is focused on regression, we will try, based on the data we retrieve, to model a phenomenon by adapting the linear regression approach. The point here comes down to the execution of a simple model that, hopefully, will help you to better understand the application of this method, by using Python. Let’s go! 9.4.1. What variables help set a rental price on Airbnb? Suppose you want to participate in one of the sharing economy’s activities and you have an apartment located in Paris, for example, that you plan to offer for rental to Airbnb customers. For this you must set a price for your apartment. But how are you going to do this? Quite simply, by adopting a linear regression approach, which will allow you to build a model that explains the price based on certain variables. You will try to predict the price of your product and define the variables that affect it. In other words, you will identify the variables that are important for setting rent, those that you will need to consider when preparing your apartment listing.
Linear Regression
153
Before starting to create our model, we must clearly understand the goal, define it, and formulate some initial questions that may explain or predict the outcome variable (prices). This will guide the processes of analysis and modelling. Identifying questions such as “what need are we trying to meet?” or “what variables should be considered?” will guide us to the best model and to the variables that we should considering using. In this context, we should therefore: – prepare and clean the data (identify missing data, etc.); – explain what we are preparing to present in order to define a way to measure the model’s performance (by identifying potential correlations between different variables, for example); – choose a model that includes the variables that are likely to improve the accuracy of the predictive model. In our case, we will choose linear regression. So, the goal is clear: to model prices for Airbnb apartments in Paris. We want to build our own price suggestion model. But to create this model, we have to start somewhere. The database used in this specific case was compiled in December 2018 in CSV format (listings.csv), and includes 59,881 observations and 95 variables, accessible to all through the Airbnb1 online platform. We will analyze this database and build our model – which will allow us to learn many things about your apartment – in three phases: – the data preparation phase, which aims to clean the data and select the most significant variables for the model; – the exploratory data analysis phase; – the modeling phase using linear regression. But before performing these three phases, we must transform our data set into something that Python can understand. We will therefore import the pandas library, which makes it easy to load CSV files. We will use the following Python libraries: – NumPy; – matplotlib.pyplot;
1 Available at: http://insideairbnb.com/get-the-data.html.
154
Sharing Economy and Big Data Analytics
– collections; – sklearn; – xgboost; – math; – pprint; – preprocessing; – LinearRegression; – Mean Squared Error; – r2_score. Let’s now move on to the first phase, that of data preparation. 9.4.1.1. Data preparation The first thing to do in the data analysis process, as we saw in Chapter 6, is to prepare the material that we will use in order to produce good results. Data preparation therefore constitutes the first phase of our analysis. But before taking this step, let’s look at our database for a moment (Figure 9.3). Now we will clean and remove some variables in this first phase of analysis. The goal is to keep those that are able to provide information to the modeling process. As the database file contains 95 variables, we removed those that seem useless (like: id, name, host_id, country, etc.) and many more columns containing information about the host, the neighborhood, etc. Obviously, this type of variable will not be useful for this analysis. All of the retained variables form the linear predictor function, whose parameter estimation consists of formulating a modelling approach using linear regression. From this, we conducted an initial review of 39 independent variables for which 59,881 observations would be used to model the expected value in euros for a night of accommodations in your apartment. The nature of these variables is therefore both quantitative and qualitative. Quantitative variables contain, for example, the number of comments, certain characteristics such as: latitude/longitude, etc. Qualitative variables include the neighborhood, room_type, etc.
Figure 9.3. Overview of the data-set
Linear Regression 155
Sharing Economy and Big Data Analytics
Figure 9.4. Variables with missing values
156
Linear Regression
Figure 9.5. The emptiness of the function for all variables. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
157
158
Sharing Economy and Big Data Analytics
Now that we know our variables, we can proceed with data preparation. Using the describe function in Python, we can identify the variables with missing values, which we described previously, as shown in Figure 9.4. We note first that all non-numeric variables are eliminated. Due to the nature of the regression equation, the independent variables ( ) must be continuous. Therefore, we must consider transforming categorical variables into continuous variables. What about the other categorical variables (qualitative)? You would like, no doubt, to find out which of the model’s variables are empty. For this, we will use the percent_empty function to allow you to see all the variables and to represent the results graphically (Figure 9.5). From this figure, we can say that the square_feet variable does not contain any value (almost 100%). Similarly, the license, security_deposit, and cleaning_fee variables contain missing values at a rate of more than 30%. Therefore, we will remove these variables. There are also other variables that contain missing values, namely: neighborhood, zipcode, and jurisdiction_names, but since the percentage is very small, we can incorporate them into the model. For the remaining variables, we will eliminate the street variable because we already have information in zipcode, as well as the calendar_updated and calendar_last_scraped variables. Thus, we can see that certain variables contain the variable “NA”, which we won’t use in the next part of the analysis because they have no value for modeling. This phase allows us to obtain descriptive statistics for the dependent variable (price) in our model, such as: the mean, standard deviation (std), and the minimum and maximum values, as shown in Table 9.1. Parameter Mean Standard deviation Min Max
Value 110.78 230.73 0 25,000
Table 9.1. Descriptive statistics
Linear Regression
159
These statistics, in addition to the graphics, of course, are very useful for understanding the data used. For example, we can say that the average price of apartments in Paris in all the observations is 110.78 euros per night. Another important point to mention here is that we have a very high maximum value (25,000) compared to the mean. Our work is therefore not finished, and we have to look into this. Is this a false value? If this is the case, how should we handle this problem? To answer these questions, we will use the sort_values function to pull data from the price variable in descending order.
Figure 9.6. Apartment prices in descending order. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
The results in Figure 9.6 show that this is not an inconsistency, but only a high proportion of the apartment’s price. After preparing the data, we will move onto the next stage! 9.4.1.2. Exploratory analysis After determining the explanatory variables that will be used in the model, it is important to now consider whether each provides completely or partially overlapping information. This is defined in the regression process as
160
Sharing Economy and Big Data Analytics
“multicollinearity”, which exists when two or more independent variables are highly correlated. To detect the existence of this type of regression problem, and before proceeding with modelling, the correlations must be analyzed. In other words, for you to better understand this data and to identify interesting trends by building our model, we will examine variables that may be correlated with each other. It should be noted here that we will mainly use the pandas libraries to browse the exploratory data analysis workflow and to manipulate our data, along with seaborn and matplotlib to draw the necessary graphics. For this, we used the corr function, and the results are presented in Figure 9.7.
Figure 9.7. Correlation matrix. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
Linear Regression
161
Figure 9.7 outlines the Pearson correlation coefficient when examined by means of pairwise comparisons. A correlation coefficient indicates whether two variables tend to vary together. It describes the strength and direction of the relationship between variables. The most commonly used correlation coefficient is the Pearson correlation coefficient. This coefficient is used to analyze linear relationships. It is calculated using the following formula: =
1 −1
(
− )(
− )
=1
Box 9.2. The Pearson correlation coefficient
In this figure, we want to show you how to create a visual graph (data visualization), which will allow us to see the values more clearly. Two colors are used (red and blue) to help us better understand the correlations between the different variables. The color red represents a negative correlation and the intensity of scaling is dependent on its exact value. The color blue, meanwhile, represents a positive correlation. By analyzing the correlations, we found some moderately strong (positive or negative) relationships between the different features. The results of the correlation analysis, which varies from positive to negative correlation (see the measurement scale), reveals that there is a clear positive correlation, for example: – between availability_60 and availability_90 (0.94); – between host_total_listings_count and calculated_host_listings_count (0.88); – between accommodates and beds (0.86). The other variables, however, are only moderately or weakly correlated with each other. So, we decided to delete some and keep others like availability_90 because this variable is weakly correlated with other variables. Therefore, we’ll only keep the host_total_listings_count variable, etc. We’ll also eliminate variables with NA results (empty lines or columns in Figure 9.7), which includes: requires_license, has_ availability, and is_business_travel_ready, which will not be useful for our model.
162
Sharing Economy and Big Data Analytics
The selection of explanatory (independent) variables is now complete. We finally have a data framework with 20 variables and 58,740 observations (98% of all observations). And we have deleted a total of 75 variables. Now we have to deal with the characteristics of our dependent variable, “apartment prices” (price). For this, we’ll import the shuffle function, which intermixes the data-set to achieve a series of good results, or model. Let’s go! First, it is important to examine the distribution of our dependent variable (price). Figure 9.8 provides us with information about its overall distribution.
A
B
Figure 9.8. Price distribution. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
We see that the prices in this data-set are highly asymmetrical (2.39). In Figure 9.8, we can also see that most of the data points are below 250 (A). We will use a subset of our database, where the price varies from 50 to 250, to remove the very high and low prices. We’ll also transform the target variable to reduce the asymmetry (B). Now, let’s examine the effects of different variables that have an effect on prices. In Figure 9.9, you will find box plots for some of the variables in relation to price. They show consistency among variables such as room_type, bed_type, and property_type, which could affect the price of the house or apartment (Figure 9.9). This figure shows comparative box plots for these variables. We note that there is a price difference between the condition of the room or the apartment in general. Therefore, cleanliness, room type, and bed type are the most expensive options to select at Airbnb apartments in Paris.
Linear Regression
163
Now, let’s look at other available features, such as bedrooms and bathroom. The results show that there is a positive relationship between physical characteristics and prices. This seems reasonable, because the best equipped apartments are undoubtedly the most expensive (Gibbs et al. 2018).
Figure 9.9. Apartment prices as a function of certain characteristics. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
Figure 9.10 shows the effect of the number of bedrooms and bathrooms on the price per night for Airbnb apartments in Paris. These two features are very important for setting the price, as the correlation coefficient between these two variables confirms (0.62).
164
Sharing Economy and Big Data Analytics
Figure 9.10. Price per number of bedrooms/bathrooms. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
But are bedrooms and bathrooms the only features of an apartment? Of course not! We should also take into account another very important feature that can also influence the price. These are amenities, namely an apartment’s conveniences.
Figure 9.11. Distribution of prices as a function of amenities. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
Figure 9.12. The price/neighborhood relationship. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
Linear Regression 165
166
Sharing Economy and Big Data Analytics
The database includes over 65 amenities, and Figure 9.11 shows the 20 most popular features. Heating, basic equipment, kitchen equipment, and television sets are among the most common amenities that Airbnb apartments provide for clients. The apartment’s location is also an important factor that often has a significant impact on prices. For you to really see how important it is, we have shown price as a function of neighborhood in the following figure. Figure 9.12 shows the location of the apartment is important and that it strongly influences the price. Before finishing this step and going to the next one, we selected numeric variables and explored them all together. Thus, we have transformed categorical variables into a numeric format. This process converts these variables into a form that can be read by Machine Learning algorithms to achieve a better outcome prediction. Now that we have finished the exploratory data analysis phase, we can move to modeling. 9.4.1.3. Modeling Exploratory analysis allowed us to conclude that the overall condition of the apartment affects the price. Other important features you should consider are cleanliness, type of room and bed, square footage, etc. Some additional information, such as access to shops and public transportation, is very useful for setting the apartment price. In the end, we selected 20 independent variables (Table 9.2) that are the most promising on the basis of the exploratory analysis we performed. Dependent variable
Price
Independent variable host_total_listings_count
Neighborhood
Zipcode
property_type
Latitude
room_type
Longitude
Accommodates
Bathrooms
Bedrooms
Beds
bed_type
Amenities
guests_included
extra_people
availability_365
maximum_nights
minimum_nights
instant_bookable
calculated_host_listings_count
Table 9.2. List of explanatory variables
Linear Regression
167
In this step we wanted to manipulate certain characteristics of our dependent variable (price), since they are recovered in price format, and the data contains the thousands separator (“,”) and the “$” symbol. For this, we performed the following operation:
Then we moved on to the modelling step, which involves the evaluation of the model (among those tested) that gives the best results, making it possible to generalize the results to unused data. Since the objective of this case study is to build a model to predict the prices of Airbnb apartments in Paris, the type of model we are using is a linear regression model. So, we chose to perform a linear regression to account for the various dependencies that may be present in the data. To resolve the multicollinearity problem between some of the independent variables, we also used Lasso and Ridge regression. Before turning to the application of the model, it is essential to divide (split) the data into testing sets (train), so as to create a set of unchangeable data for assessing the model’s performance.
Given that: – X_train: all predictors in the test data-set (train); – Y_train: the target variable in the test; – X_test: all predictors for the entire test; – Y_test: the target variable in the test set; in our case, it’s price. In this context, we used sets of variables that were pre-processed and developed in the previous step to perform these operations, using different combinations of variables. Then we adjusted the parameters for each model that was developed; then we selected the model with the greatest available precision.
168
Sharing Economy and Big Data Analytics
This optimization process is the final step before reporting results. It makes it possible to find a parameter or a set of robust and powerful parameters using an algorithm for a given problem. For the Ridge and Lasso regression models, we tested different parameters (0.001, 0.01, 0.1, 1, 10, 100). The hyperparameters for each model and their validation scores are described in Table 9.3. Model
Hyperparameter
RMSE
Linear regression
By default
2368.59
Ridge
Alpha = 1.00
14.832
Lasso
Alpha = 0.1
15.214
Table 9.3. RMSE for different regression models
We can see from the results in Table 9.3 that linear regression does not do a good job of predicting the expected price because our set of variables contains many correlated data points. It was over-adjusted for Airbnb apartments in Paris, which led to horrible RMSE scores. The RMSE (Root Mean Squared Error) is the square root of the residual variance (MSE). It shows the absolute degree to which the model fits the data, that is to say the distance between the observed data points and the values predicted by the model. While R2 is a relative measure of adjustment, RMSE is an absolute measure of adjustment. As the square root of the variance, RMSE can be interpreted as the standard deviation of the unexplained variance and has the useful property of being in the same units as the response variable. Lower RMSE values indicate a better fit. RMSE is a good measure of how accurately the model predicts the answer and is the most important criterion for adjustment if the main purpose of the model is the prediction. Box 9.3. RMSE (Root Mean Squared Error)
Thus, we can infer that Ridge regression outperformed Lasso. Ridge regression seems to be best suited for our data, because there is not much difference between the scores of the (train) and (test) data-sets. We can conclude that the prices of Airbnb apartments in Paris are heavily dependent on available equipment and location. Indeed, most of the variables that have led to price increases are related to equipment and location.
Linear Regression
169
We can apply other algorithms, such as “random forest”, decision tree, etc. to see if these characteristics (amenities/location) are the main factors for predicting the prices of Airbnb apartments in Paris. But we’ll leave that to the next chapter, in which we’ll study these algorithms more thoroughly. 9.5. Conclusion The key to conducting good data analysis is still mastering the analysis process, from data preparation to the deployment of results, passing through exploration. Yes! It’s not as easy as it seems, but it’s one of the steps to be followed to achieve the goal. In this chapter, we used a real-world database to create a predictive model for the prices of Airbnb apartments in Paris. We used a real case study to show you the potential of analytics and the power of data. Furthermore, this example has allowed us to explore our original question concerning which variables influence the price. The creation of a predictive model for Airbnb prices using regression models provided us with information on the 59,881 observations we analyzed. The analysis began with the exploration and examination of the data to perform the necessary transformations that will contribute to a better understanding of the problem. This guided us in the selection of some important characteristics, by determining a correlation between how these characteristics lead to higher or lower prices. Many of you will now want to, no doubt, make the most of regression and learn how to apply its concepts in areas that interest you. So, what you really need is to practice using other examples, such as Uber, BlaBlaCar, etc. This will not only allow you to use regression models to predict future trends that you would perhaps not have thought of before, but also to delve deeper into the analytics universe. TO REMEMBER.– In this chapter, we learned that: – linear regression is used to predict a quantitative response Y from predictive variable(s) X; it assumes a linear relationship between X and Y; – during the creation of a linear regression model, we:
170
Sharing Economy and Big Data Analytics
- try to find a parameter; - try to minimize the error; – the use of regression to model Airbnb data allowed us to predict future outcomes; – we can use other regression models such as logistic regression (categorical variables) and Ridge and Lasso regression (multicollinearity); – although regression analysis is relatively simple to perform using existing software, it is important to be very careful when creating and interpreting a model.
10 Classification Algorithms
A classification is a definition comprising a system of definitions. Karl Wilhelm Friedrich Schlegel
10.1. Introduction Linear regression, as we saw in the previous chapter, is used when we want to build a model that predicts a value. But what happens if we want to classify something? In this case, we can use classification algorithms. These algorithms, which belong to the supervised learning group, appear in applications related to data analysis. These algorithms can be used on structured or unstructured data. They seek to assign class labels to new data-sets or observations based on existing observations. In other words, these algorithms are a set of techniques and data analysis methods that classify data by identifying the class or category to which new data belongs. To better explain, let’s look at a simple example. This is the example that we have already mentioned in the previous section. It’s the spam filter that can learn to report spam using other examples of spam that contain unreliable terms such as “money transfer”, or that come from senders not in the recipient’s contact list. Based on an email’s contents, messaging providers use classification to determine if incoming email messages are spam (Androutsopoulos et al. 2000; Sedkaoui 2018a). Classifications algorithms will, depending on the data-set, which are also called “training sets”, mark spam in incoming emails. They classify emails based on experience, corresponding to the training data.
Sharing Economy and Big Data Analytics, First Edition. Soraya Sedkaoui and Mounia Khelfaoui. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.
172
Sharing Economy and Big Data Analytics
So unlike regression analysis, which seeks to predict a value based on previous observations, classification predicts the category to which the new data-sets may belong. It is a set of techniques for determining and classifying a set of observations or data (Sedkaoui and Khelfaoui 2019). Classification techniques begin with a set of observations to determine how the attributes of these observations may contribute to the classification of future observations. Classification techniques are widely used for prediction purposes. They are useful for customer segmentation, the categorization of audio and images, and the analysis of text and customer sentiment analysis, etc. If you want to look at such problems by applying unsupervised algorithms for your project, consider algorithms such as decision trees, k-Nearest Neighbors (kNN), Support Vector Machine (SVM), neural networks, Naïve Bayes, Random Forest, etc. It should also be noted that logistic regression, which we have previously discussed, is also one of the most common methods of classification. In this chapter, we’ll focus mainly on the different methods relevant to a use case from the sharing economy context. We will study a variety of classification algorithms that you can use to analyze large datasets. By opening these algorithms’ black boxes, this chapter will help you better understand how they work. 10.2. A tour of classification algorithms Classification is a Machine Learning technique that uses known data to determine how to classify new data in a set of existing categories. This technique includes a set of algorithms that we can use to determine how the new data should be classified under a set of labels, classes, or categories using known data. In this part of this chapter, we will go through a selection of classification algorithms commonly used in data analysis. This will allow you to learn the differences between classification algorithms and to develop an intuitive understanding of their applications. 10.2.1. Decision trees Decision trees, also called “prediction trees”, use a tree to specify sequences of decisions and their consequences. Given the attributes of observations and classes, the decision tree produces a sequence of rules can be used to classify data. This algorithm is based on a graphical model (trees) to determine the final decision. Their goal is to explain a value from a set of variables.
Classification Algorithms
173
This is the classic case of a matrix X with m observations and n variables associated with value Y to be explained. The decision tree thus builds classification or regression models in tree form. Classification models are generally applicable to categorical types of output variables, such as “yes” and “no”, “true” and “false”, etc. However, regression models can be applied to discrete or continuous output variables, such as the expected price of a service (an apartment) or the probability that a subscription will be requested. This algorithm can be applied to various situations, and its visual presentation helps to break down a set of data into subsets while incrementally developing an associated decision tree. Given its flexibility and that its corresponding decision rules are fairly simple, this algorithm is frequently used in Big Data analysis. 10.2.1.1. The structure of the decision tree Decision trees use a graph structure in which each potential decision creates a new node, which results in a graph similar to a tree (Quinlan 1986). Each node has a state, and connections are based on this state. The further we descend into the structure, or down the tree, the more these states are combined. In other words, a decision tree uses a tree structure to represent a number of possible decision paths and a result for each path. To better understand this structure, we’ll use an example that will help us predict whether a proposed service platform will interest people.
Figure 10.1. Example of a decision tree
We can see in Figure 10.1 that the tree acts like a flowchart. A decision tree is a classification algorithm that uses tree-like data structures to model decisions and results. The structure of this algorithm is as follows:
174
Sharing Economy and Big Data Analytics
– Nodes, or branching points, which are often called “decision nodes”. They usually represent a conditional test. These nodes are the decision or test points. Each node refers to an input variable or an attribute. Each tree starts with a root node where the most significant attribute for all the observations is located (gender, in our case). – The branch represents the result of each decision and connects two nodes. Depending on the nature of the variable, each branch makes it possible to visualize the decision’s state. If the variable is quantitative (numerical), we place the upper branch to the right. We may include certain components (equal, less, etc.). – Leaf nodes indicate the final result (interested in this service or not). They represent class labels in classification problems, or a discrete value in regression problems. Note that the depth of a tree is the minimum number of steps required to achieve the result from the root. In our example, the “age” and “income” nodes are one step from the root, and the four nodes at the bottom are the result of all previous decisions. From the root to the leaf lie a series of decisions taken at various internal nodes. 10.2.1.2. How the algorithm works The purpose of a decision tree is generally to build a tree T from a set of observations S. If all S observations belong to a class C, “gender: male” in our example, then the node is regarded as a leaf node and receives the label C. However, if any S observations do not belong to the class C, or if S is not pure enough, the algorithm selects the next most informative attribute (age, etc.), and the S scores based on the values of this attribute. Therefore, the algorithm builds subtrees (T1, T2, etc.) for subsets of S, recursively, until one of the preceding criteria is fulfilled. The first step is to select the most informative attribute. A common way to identify this attribute is to use methods based on entropy (Quinlan 1993). These methods select the most informative attribute based on: – entropy: shows the uncertainty of the randomness of the elements or, in other words, a measure of the impurity of an attribute. Entropy measures the homogeneity of a sample. If the sample is completely homogeneous, entropy is zero, and if it is divided equally, it means that it has an entropy equal to one. Either a class and its label ∈ or ( ) the probability of . , the entropy of is defined as follows: ( )
=− ∀ ∈
( )
Classification Algorithms
175
– information gain: this measures the relative change in entropy in relation to the independent attribute. It tries to estimate the information contained in each attribute. Information gain therefore measures the purity of an attribute. To build a decision tree, we must find the attribute that returns the highest information gain (that is to say, the most homogeneous branches). Information gain assigns a class for filtering on a given node of the tree. Classification is based on the entropy of the information gain in each division. The information gain of an attribute A is defined as the difference between the base entropy and the conditional entropy of the attribute: =
−
/
10.2.2. Naïve Bayes The Naïve Bayes algorithm is a set of classification algorithms based on Bayes’ theorem. This is not a single algorithm, but a family of algorithms that share a common principle, that is to say that each pair of features is classified independently from each other. To help you understand the principle of this algorithm, we will look at its different applications. 10.2.2.1. The applications of the algorithm Naïve Bayes is a fairly intuitive classification algorithm to understand. A classification using this algorithm assumes that the presence or absence of a particular characteristic of a class is not related to the presence or absence of other features. For example, an object can be classified according to its attributes such as shape, color, and weight. It can be used for binary and multiclass classification problems. The main point is the idea of treating each feature independently. The Naïve Bayes method assesses the probability of each entity independently, irrespective of any correlation, and determines the prediction based on Bayes’ theorem (Sedkaoui 2018b). Using this method has the following advantages: simplicity and ease of understanding. Moreover, this algorithm generally performs well in terms of resources consumed, since only the probabilities of the characteristics and classes must be calculated; it is not necessary to find the coefficients as in other algorithms. Because Naïve Bayes is easy to implement and can be run efficiently, even without prior knowledge of the data, it is one of the most popular algorithms for classification of textual data.
176
Sharing Economy and Big Data Analytics
Besides its simplicity, this algorithm is also a good choice when CPU and memory resources are limiting factors. This classification algorithm can be used in real-world applications such as: – sentiment analysis and text classification; – filtering unwanted messages (spam); – recommendation systems; – facial recognition; – fraud detection; – real-time predictions; – predicting the probability of several classes for the target variable. Its main drawback is that each feature is treated independently, even though in most cases this cannot be accurate (Bishop 2006). 10.2.2.2. Operation of the Naïve Bayes algorithm Naïve Bayes is a classification technique based on Bayes’ theorem that assumes independence between the predictors. That is to say, it assumes that the presence of an entity in a class is not linked to any other entity. Although these features are dependent on each other or on the existence of other features, all of these properties are independent. Bayes’ theorem reveals the relationship between two events and their conditional probabilities. Bayes’ law is named after the English mathematician Thomas Bayes. Mathematically, this theorem describes the relationship between the probability of A and B ( ( ) and ( )) and the conditional probability of A given B and B given A, namely ( / ) and ( / ). The conditional probability that event A will occur, given that event B has already occurred, is written as ( / ) and can be calculated using the following formula: ( / )=
( / ) ( ) ( ) Box 10.1. Bayes’ theorem
This theorem can be extended to become a Naïve Bayes classifier. First, we can use the conditional independence hypothesis, which states that each attribute is conditionally independent of any other attribute given the label of class Ci:
Classification Algorithms
( ,
,…,
/
)= (
/
) (
/
)⋯ (
/
) =
Therefore, this assumption simplifies the calculation of ( ,
( ,…,
177
/ /
) ).
Then, we can ignore the denominator ( , , … , ), since it appears in ( / ) (for = 1, 2 … ); removing the denominator will have no impact on the relative probability scores and will simplify the calculations. Here, represents the attributes or characteristics, and C' is the response variable. Now, ( / ) becomes equal to the product of the probability distribution for each attribute X. We are interested in finding the probability ( / ). Now, for several values of , we will calculate the probability for each of them. So how do we predict the class of the response variable as a function of the different values that we obtain for ( / ) ? We simply take the most likely or the maximum of these values. Therefore, this procedure is also known as “maximum a posteriori estimation”. 10.2.3. Support Vector Machine (SVM) Unlike regression, which can be regarded as a sword that is capable of slicing and dicing data efficiently but is unable to process highly complex data, “SVM” is like a sharp knife that works on smaller data-sets but can be much more powerful for building models based on these data-sets. We assume that you are already accustomed to the algorithms of linear regression and logistic regression. If not, we suggest that you reread Chapter 9 before moving on to the Support Vector Machine algorithm (SVM). A Support Vector Machine is another simple algorithm that produces accuracy results with less computing power (Cristianini and Shawe-Taylor 2000). SVM can be used both for regression and classification tasks, but it is especially widely used in classification. 10.2.3.1. Definition of SVM A Support Vector Machine is an algorithm that seeks to divide the hyperplane with the widest possible gap to improve resistance to noise. Like logistic regression, it is a discriminating method that focuses only on predictions.
178
Sharing Economy and Big Data Analytics
This is a classification method, in which we include each data point in a dimensional space (where n is the number of entities) and the value of each entity is the value of a particular coordinate. Then, the classification is done by searching the hyperplane, which clearly differentiates the two classes. For example, if we have only two variables, such as renting an apartment or a villa on Airbnb, we must first draw the two variables in a two-dimensional space ( = 2), with each point having two coordinates called support vectors (see Figure 10.2 for more detail).
Figure 10.2. Coordinates of entities divided into two classes. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
Now let’s look for a line that divides the data between the two groups of differently classified data, knowing that the distance of the nearest point for each group will be those farthest away. So, in SVM, the goal of optimization is to maximize “the gap”. This gap represents the distance between the separating hyperplane and the nearest points of this hyperplane, called support vectors. The question that now arises is: “How do we identify the right hyperplane?” 10.2.3.2. SVM: how it works In Figure 10.2, the line that divides the data into two different classes is the black line, because the two closest points are farthest from that line. Then, as a function of the data’s location on one hand and the line on the other, we can classify new data.
Classification Algorithms
179
Figure 10.3. How SVM works. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
As shown in Figure 10.3, there are several ways to separate the data. To resolve this issue, SVM seeks the best line separating the data by maximizing the orthogonal distance between the nearest points of each data category and the limit of the line. Distances called “gaps”, and the points used to define the gap (the nearest boundary points) are called “support vectors”. Once the ideal function is determined, the algorithm is able to predict whether people prefer to rent an apartment or a villa. This example illustrates the application of this algorithm and shows a linear hyperplane between these two classes or groups. SVM can effectively perform linear classifications and can also be configured to perform nonlinear classifications (Cristianini and Shawe-Taylor 2000). 10.2.4. Other classification algorithms In addition to the three classification algorithms presented above, several other algorithms are also used to analyze large amounts of various kinds of data. In particular, these include k nearest neighbors (kNN), random forest and neural networks. These algorithms are commonly used methods for obtaining the best models. 10.2.4.1. The k-nearest neighbors (kNN) The kNN, or k-Nearest Neighbors is an algorithm that can be used both for classification and regression. The principle of this model is to choose the closest
180
Sharing Economy and Big Data Analytics
data points under study to predict its value (Sedkaoui 2018b). In classification or regression, the input will consist of the closest training examples in a space. To understand how this algorithm works, let’s consider a small visual example that will help you understand this algorithm. In Figure 10.4, we have two classes of data, red circles and black squares. It should be noted that the input is twodimensional and the target to be classified is the shape.
Figure 10.4. How does kNN work?. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
Suppose now that we have a new input object whose class we want to predict. How should we proceed? Well, let’s consider the nearest neighbors to this object and see which class contains the majority of these points in order to deduce its class. In this example, we used 5-NN ( = 5), and we can say that the new entry belongs to the class of red circles, as its nearest neighbors contain three (3) red circles and two (2) black squares. The principle of this algorithm is to assign a set of data to one of the class categories by calculating the distance separating it from each point in the data-set. We choose the first elements in the series of distances and therefore choose the dominant label from among the elements, which represents the category of the element in the data-set. 10.2.4.2. Random Forest Random Forest is one of the most popular supervised learning algorithms. It requires virtually no preparation or data modeling and usually provides accurate results (Sedkaoui 2018a). Random forests are based on the decision trees described above. More specifically, random forests are collections of decision trees (Breiman 2001) that produce a better forecast. This is why it is called a “forest” (or set of decision trees). So, as the name suggests, this algorithm creates a forest and renders it somewhat random. This forest constitutes a set of decision trees, generally trained with the bagging method (Breiman 1996).
Classification Algorithms
181
The general idea of the bagging method is that a combination of learning styles improves the overall result. Bagging uses the bootstrap technique, which samples repeatedly, with replacement of a data-set based on a uniform probability distribution. “With replacement” means that when a sample is selected for a data-set, it is nevertheless kept in the data-set and can be selected again. When sampling is done with replacement, some samples may appear multiple times in a set of data, while others may be absent. A model or a basic classifier is formed separately on each bootstrap sample, and a test sample is assigned to the class that receives the largest number of votes. Box 10.2. Bagging or bootstrap aggregation
Random Forest develops several trees into a model. To classify a new object or a new entry in terms of new attributes, each tree provides a classification, and we say that this tree “votes” for this class. The forest chooses classifications with the most votes among all the trees in the forest and takes the average difference of the production of different trees. This algorithm therefore builds several trees and combines them to get a more accurate result. 10.2.4.3. Neural networks Neural networks are inspired by the neurons in the human nervous system. They help find complex patterns in the data. These neural networks learn a specific task as a function of the data-set (Sedkaoui 2018b). Thanks to recently developed Deep Learning algorithms, neural networks have been developed as a method for facilitating a number of complex tasks such as image classification, voice recognition, etc. The term Deep Learning refers to juxtaposed neural networks (or the number of layers) and therefore to the depth of the network. It is attracting, among others, the latest advances in neuroscience and the communication patterns in our nervous system. Some also associate it with modeling that offers a higher level of data abstraction to provide better forecasts. In these networks, the learning phase aims to converge the data parameters toward an optimal classification. They require a lot of training data and are not suitable for all problems, especially if the number of input parameters is too low. Deep Learning is particularly effective for processing images, sound, and video. It is found in sectors including health, robotics, computer vision, etc. Box 10.3. Deep Learning
182
Sharing Economy and Big Data Analytics
In neural networks, we have an input layer that receives the input data and then propagates the data to the hidden layers. Finally, the output layer produces the classification result. Each tier of the neural network is a set of interconnections between the nodes of a tier with those in the other tiers. So, neural networks consist of nodes (Figure 10.5).
Figure 10.5. Example neural network
So, we’ve taken the tour of the different classification algorithms. We found that each algorithm has its own properties. But to help you identify the most suitable method for a specific problem, we suggest, in Table 10.1, a list of items to consider. Depending on the set of training data and the variables, we will opt for one of these algorithms. The goal is the same: to predict or classify new entries. For example, to determine whether a new email is spam or not. Now let’s examine the theory behind these classification algorithms and use an example from the sharing economy to explain how these methods work in practice, while continuing to explore how Big Data analytics can be used to generate value for companies in this economy.
Classification Algorithms
Questions
183
Algorithms
Should we include the class of the classification? How can the variables affect the models?
Logistic regression
Are some of the input variables correlated?
Decision tree
Does the data contain several types of variables? Are certain input variables relevant?
Logistic regression
Does the data contain categorical variables with many levels?
Naïve Bayes
Is the data high dimensional?
Naïve Bayes
Is there nonlinear data or discontinuities in the input variables that can influence the outcome?
Decision tree
Is it an interoperability problem?
Random forests
Is the goal to predict the probability of a particular event?
Logistic regression
Can the linear model be extended to nonlinear problems?
SVM
How can we make predictions without a training model, but with a more expensive prediction step in terms of processing?
kNN
Are we looking for complex models in the data?
Neural Networks
Table 10.1. How do we choose the classification algorithm?
10.3. Modeling Airbnb prices with classification algorithms Are some factors more important in terms of generating higher rates? To answer this question, we’ll use what we already have to model and predict the price of Airbnb apartments in Paris from the set of observed data. In this section, therefore, we’ll reexamine the example from Chapter 9 to better show you how different classification algorithms can be used. But first, we think it’s important to review the various steps of the process we have adopted.
184
Sharing Economy and Big Data Analytics
10.3.1. The work that’s already been done: overview We started with data preparation and processing to make it more consistent and to eliminate missing and non-significant values (NAN). Next, we performed an exploratory analysis to identify the most significant variables that can best explain the model. This analysis allowed us to discover the importance of some attributes, such as: bedrooms, neighborhood, room_type, etc., and how they can influence prices. Therefore, after analyzing all the results, it was determined that 20 variables are the most significant for the model. These variables will serve in this chapter as a reference for comparing the results of different models. These steps represent the key that helped us exploit 58,740 observations. Using different techniques, exploratory analysis allowed us to advance step by step to understand the most important variables before running the regression models (linear, Ridge and Lasso). The results of the regression technique applied in Chapter 9 argue in favor of the assumption that the prices of apartments are higher if they are close to the center of Paris and if they have necessary equipment (beds, bathrooms, etc.) in addition, of course, to its condition (type of housing, property, etc.). A linear model was created by measuring the price compared to 20 variables that were measured at the 0.001 level. The RMSE, measured at 2368.596, was higher than the ideal. This means that the expected price for a rental can vary by 2368.596 euros when compared to the real rental price. Linear regression therefore generated the worst performance. Two other models were used, Ridge and Lasso, to solve the problem of highly correlated independent variables. Both models had relatively similar test scores. But the performance of Ridge was better than Lasso, and the model didn’t overadjust too much in relation to the training set. Ridge regression allows for better price predictions for Airbnb apartments. Now, we will also try other algorithms that make it possible to better predict prices. We will use Random Forest and a decision tree to determine the characteristics that are more important for determining the prices of Airbnb apartments in the French capital.
Classification Algorithms
185
10.3.2. Models based on trees: decision tree versus Random Forest Since our database is ready, having been prepared and analyzed in Chapter 9 of this book, we can begin with the modeling phase using the algorithms we have discussed in this chapter. The objective is to continue seeking a model that can best present our data and enable prediction of the value of the results, or the price of Airbnb apartments in Paris. As you have seen in the previous sections of this chapter, there are several algorithms that we can take advantage of to build the model. But we’re going to use two of them: decision tree and random forest. 10.3.2.1. Decision trees As we indicated at the beginning of this chapter, decision tree refers to types of classification algorithms that have a predefined target variable. This algorithm is mainly used for classification but can also be applied to prediction problems with both categorical and numerical variables. Tree models can capture non-linear relationships in sets of observations. Using a tree structure, a decision tree comprises a plurality of internal nodes, each of which corresponds to a characteristic (e.g. the number of rooms), and each decision node is a class label (e.g. the number of rooms is less than or greater than 2). Now we’ll adjust the decision tree algorithm for our listings.csv data. But before doing this in Python, we need to import certain required libraries, such as: – NumPy and pandas: for manipulating data; – train_test_split from cross_validation: for splitting the data into a train (x_train, y_train) and a testing (x_test, y_test) set; – DecisionTreeClassifier: for creating a decision tree classifier; – Accuracy_score: for calculating precision metrics from the variables of the predicted classes. To understand the performance of the model, we divided the data-set into a train set and a test set using the train_test_split function. The test_size parameter has a value of 0.3; this means that the test sets will account for 30% of all data, and the training data-set will represent 70% of the data. The random_state variable is a state generator for pseudo-random numbers used for random sampling.
186
Sharing Economy and Big Data Analytics
The decision tree algorithm used for classification problems (like the one we are currently examining) aims to create one of these decision trees in order to assign records to a defined number of categories.
Figure 10.6. The 10 most important features (decision tree). For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
Having trained the decision tree model (with the train function) to classify characteristics according to their importance, we are now able to determine what features have greater weight in the calculation of prices. The results of this classification are illustrated in Figure 10.6.
Classification Algorithms
187
Drawing a conclusion from this model, we can see that the algorithm has identified amenities and location as the main contributors to the way in which it predicted prices. We found that the most important features were: – longitude; – host_total_listings_count; – latitude; – accomodates; – bedrooms; – bathrooms. Other features (Elysée, etc.) concerning the apartment’s location. The question that you may now be asking is: how do we select the criteria and values for classification? It is the intelligence of the algorithm that does it for you; the principle is simple: start with the most discriminating variables, that is to say that we must create the greatest possible level of disorder, a principle that is mathematically modeled by the entropy function (Sedkaoui 2018b). Using the data-set, we calculated the RMSE, which is equal to 21.505. What we can see from this value is that it is very high compared to the RMSE values for the Ridge (14.832) and Lasso (15.214) regressions. These characteristics gain more weight because they tend to be more linearly correlated. Those whose correlation coefficient was higher tended to increase apartment prices. As can be seen in the resulting graph, the decision tree captures the general trend of the data. However, one disadvantage of this model is that it does not account for the continuity and differentiability of the desired prediction. In addition, we must make sure we select an appropriate value for the tree depth to avoid over- or underadjusting the data. We will use the Random Forest algorithm to process the data. 10.3.2.2. Modeling using Random Forest We chose the Random Forest model because this model uses a blending technique that combining several models into one. This method usually reduces the variance and bias in observational data, which improves forecast accuracy.
188
Sharing Economy and Big Data Analytics
The advantage of the Random Forest algorithm is independence between the trees, which makes it possible to parallelize processing and improve the algorithm’s performance. Every tree is different, and their grouping should make it possible to achieve a lower degree of variance than a single tree.
Figure 10.7. The 10 most important features (Random Forest). For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
With regard to our example, we first used the data-set to train the model by applying the Random Forest algorithm. We then used the test data-set to test this model and proceed to validation. This allowed us to determine the most important features, which include two factors (equipment/location). Among the features shown in Figure 10.7, we can identify a set of important variables, which include:
Classification Algorithms
189
– bathrooms; – accomodates; – host_total_listings_count; – bedrooms. These are almost the same variables as the significant variables in our decision tree model, which shows that they are key predictors for forecasting the prices of Airbnb apartments in Paris. We obtained an RMSE , equal to 14.775 (Table 10.2), which is lower than the RMSE values of the other models. Model Decision tree Models based on trees Random Forest Median Baseline models Mean Regression model
Ridge Linear Lasso
MSE 462.465
RMSE 21.505
218.300 243.079
14.775 15.591
243.079
15.591
219.988 5610218.588 231.465
14.832 2368.59 15.214
Table 10.2. MSE and RMSE for each model
Figure 10.8. RMSE: score for the different models. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
190
Sharing Economy and Big Data Analytics
Note that since the price distribution is slightly skewed to the right (see Figure 9.8 in Chapter 9), we have added two basic models that could explain this bias by setting average values to 110.78 euros and median values to 80 euros. By comparing the RMSE values, we see that the linear regression model showed the highest value for RMSE (2368.59) and that the random forest model has the smallest value (14.775). The random forest model surpasses other models in terms of predictive accuracy. Based on the RMSE test for each model, we selected the Random Forest model as the best model for predicting the price, because it has the smallest RMSE. This model is most effective for predicting Airbnb prices because it is easier to identify important predictors. This algorithm works well with large-scale predictors and large sets of training data. However, the downside is that prediction takes too long when there is a large number of trees. We also concluded that the most important variables for predicting prices are the variables related to location and amenities (bedroom, bathroom, location, etc.). The total number of guests as well as hospitality play a role, but all the models have shown the significance of the list of amenities, home type, and location. Based on these findings, we want to predict price using a different algorithm: kNN, because as hosts you would surely consider all factors that affect the price of your apartment; in order to automate the data analysis process, we have been using the k nearest neighbor or kNN algorithm since we began our work. 10.3.3. Price prediction with kNN We will, in this section, continue our efforts to learn what lies within our large data-set. By using the results of previous algorithms as a basis, we will analyze certain characteristics related to location and equipment that which seem to be very important for determining prices. We now look for other alternatives by using the kNN algorithm, which is first applied to select the number (similar ads) that we will compare. Then we calculated the similarity in order to classify each list as a function of similarity. Finally, we calculated the average price for and considered it as the current price. It should be noted here that we set to 5. To calculate the distance, we began with the accommodates variable and then incorporated other variables. The smallest possible Euclidean distance is equal to zero, which would mean that the current observation is similar to ours. The
Classification Algorithms
191
calculation of the Euclidean distance is described in detail in Chapter 11 of this book. The calculation of the Euclidean distance for each observation in the data-set generated the results presented in Table 10.3. Distance
Nbr
Distance
Nbr
Distance
Nbr
0
5938
5
560
10
13
1
44093
6
101
11
16
2
5279
7
135
12
13
3
3156
8
22
13
31
4
486
9
37
14
1
Table 10.3. Number of observations by distance
The results show that there are 5,938 observations in the data-set that have a distance equal to 0. This means that these 5,938 observations have welcomed the same number of people. If we were only using the first five observations that have a distance equal to zero, we might end up with biased predictions. To avoid this problem, we randomized the order of the observations and then selected the first five lines. This allowed us to randomly select the observations mentioned in Table 10.4. Observation
Price (EUR)
19843
75
10454
99
17543
129
13622
100
36921
120
Table 10.4. The price of the first five randomized observations
It should be noted here that the results contained certain characters when retrieved, such as the dollar sign ($) and commas. So, we cleaned the data to remove these characters by converting them into a floating type.
192
Sharing Economy and Big Data Analytics
Then we calculated the average price of these first five observations, and we got a value that is equal to 104.6 euros. This means that when we use only the accommodates variable to predict prices (with the same number of guests), we must set the price of the apartment to 104.6 euros. The question that arises now is: how correct is this prediction result? To answer this question, we evaluated the accuracy of this model. For this, we divided our database into two parts: – a train set, which contains half of the observations; – a test set, for testing the model of the other half. What we wanted to do here is to use half of the observations to predict the price and compare this value with the actual price in the data-set to analyze the accuracy of the result. We also calculated the RMSE of this model, and the value we obtained is equal to 272.59. To better understand the magnitude of this model, we have, at this point, incorporated other variables (bathrooms, bedrooms, longitude, etc.) to create other price predictions. We computed the RMSE of each of these variables and compared them. The results are shown in Table 10.5. predictor
RMSE
Accommodates
272.59
Bedrooms
275.36
Bathrooms
276.79
host_total_listings_count
287.49
Table 10.5. Comparison of RMSE values for different predictors
You can see, through the results in this table, that the kNN algorithm lets you capture the features that most affect the price. According to the RMSE values, we can say that the best predictor is the one with the smallest RMSE value: accommodates. So, we have built a model that explains the prices of Airbnb apartments using a single variable (accommodates). We then predicted the price by incorporating several variables at once: this is called a “univariate model”.
Classification Algorithms
193
Our goal was not only to choose a model that best represents all of our data, but also to show you the power of analytics and what we can do with the various algorithms. For more results, you can try the various techniques we have discussed in this chapter (Naïve Bayes, SVM, and neural networks). Then you need to select the most appropriate model that allows you to make predictions based on the characteristics of the data-set and predictors or variables, and, of course, the prediction goal you want to achieve. Indeed, it is important to see how companies can understand their data, how to handle missing variables, create an explanatory model, make a decision tree, etc. that will enable them to understand and form their own analytics instruction manual for working with large data-sets, or Big Data. 10.4. Conclusion In this chapter, you learned about several Machine Learning algorithms and their different uses for solving classification problems and even regression. We found that these algorithms are used for various applications. This chapter, in fact, examined the theory behind the various classification algorithms and used an example of a sharing economy company to explain how these methods work in the business world. Using this example, which is the same one we used in the previous chapter, we conducted tests using two classification algorithms: decision tree and Random Forest. The results allowed us to compare the different models and choose the best. Based on the RMSE value, the most reliable model was Random Forest. We applied the k-nearest neighbors algorithm to create a univariate predictive model. The results show that we can use other variables to improve the prediction. This fun little data analysis project is a crucial passage for anyone who wants to work with Big Data. This is a very easy data-set, but even if you think that predicting prices for Airbnb apartments is a problem that does not interest you, examples like this allow you to apply different methods and algorithms for supervised learning, pending our discussion of unsupervised algorithms (cluster analysis) which will be the topic of the next chapter. The quantity and variety of data produced today is a big advantage for sharing economy companies. Developing a data analysis process is therefore essential to creating value. This example is only one of many use cases, because data analysis
194
Sharing Economy and Big Data Analytics
methods can be applied in different circumstances that require model identification or prediction. These tests and analyses are only part of what supervised learning algorithms (regression and classification) can allow sharing economy companies to do. If software makes it possible to project or to model the past in order to develop scenarios for the future, then Big Data represents a step forward by modeling the future to make good decisions possible today. TO REMEMBER. This chapter provided the opportunity to learn that: – classification is an approach for classifying new observations from historical observations; – classification algorithms are varied, can be applied to large data-sets, and can make it possible to classify data into groups and to identify the class or category to which a new data point belongs. In addition to decision trees, Naïve Bayes, and SVM, other algorithms are commonly used for both classification and regression, such as kNN, Random Forest and neural networks.
11 Cluster Analysis
The Milky Way is nothing more than a mass of innumerable stars planted together in clusters. Galileo Galilei
11.1. Introduction In the previous two chapters, we examined the use of supervised learning techniques, regression and classification, to create models using large sets of previously known observations. This means that the class labels were already available in the data-sets. In this chapter, we’ll explore cluster analysis, or “clustering”. This type of analysis includes a set of unsupervised learning techniques for discovering hidden and unlabeled structures in data. Clustering aims to seek natural groupings in the data, so that the elements of the same group, or cluster, are more similar than different groups. Given its exploratory nature, cluster analysis is a fascinating subject, and in this chapter, you will learn the key concepts that can help you organize data into meaningful structures. In this chapter, we’ll address different techniques and algorithms for cluster analysis, namely: – learning to search for points of similarity using the k-means algorithm; – using a bottom-up approach to build hierarchical classification trees.
Sharing Economy and Big Data Analytics, First Edition. Soraya Sedkaoui and Mounia Khelfaoui. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.
196
Sharing Economy and Big Data Analytics
This chapter will therefore examine these two algorithms. As you’ll see, the best way to get a good look at the importance of cluster analysis for sharing economy companies is to apply these algorithms to a use case. We invite you to learn more, in what follows, about this type of unsupervised learning. 11.2. Cluster analysis: general framework Machine Learning is undoubtedly one of the major assets for addressing both current and future challenges in society. Among the various components of this discipline, we will focus in this chapter on one of the applications that characterizes it: cluster analysis or clustering. Cluster analysis covers diverse and varied topics and allows for the study of problems associated with different perspectives (Sedkaoui and Khelfaoui 2019). This analysis aims to segment the study population without defining a priori the number of classes or clusters and interpreting the clusters thus created a posteriori. As part of this analysis, the human does not need to assist the machine in its discovery of the various typologies. No target variable is provided to the algorithm during the learning phase. Clustering consists of grouping a set of observations so that the objects of the same group, called a “cluster”, are more alike in some way compared to other groups or clusters (Sedkaoui 2018b). In general, this method uses unsupervised techniques to group similar objects. In Machine Learning, unsupervised learning refers to the problem of finding a hidden structure in unlabeled data. In clustering, no predictions are made, because the different techniques identify the similarities between observations based on the observable attributes and group together similar observations in clusters. Among the most popular clustering methods are the k-means algorithm and hierarchical classification, which we'll cover later in this chapter. Cluster analysis is considered one of the main tasks of data exploration and a common technique for statistical analysis, and is used in many areas including Machine Learning, pattern recognition, image analysis, information retrieval, bioinformatics, etc.
Cluster Analysis
197
Finding patterns in data using clustering algorithms seems interesting, but what exactly makes this analysis possible? 11.2.1. Cluster analysis applications At first glance, one might think that this method is rarely used in real applications. But this is not the case, and in fact there are many applications for clustering algorithms. We’ve seen them before, when looking at how Amazon (Amazon Web Services) recommends the right products, or how YouTube offers videos based on our expectations, or even how Netflix recommends good films, all by applying the clustering algorithm (Sedkaoui 2018a). For sharing economy companies, cluster analysis can be used to explore datasets. For example, a clustering algorithm can divide a set of data containing comments about a particular apartment into sets of positive and negative comments. Of course, this process will not label the clusters as “positive” or “negative”, but it will produce information that groups comments based on a specific characteristic of these comments. A common application for clustering in the sharing economy universe is to discover customer segments in a market for a product or service. By understanding which attributes are common to particular groups of customers, marketers can decide which aspects of their campaigns to highlight. The effectiveness of applying a clustering algorithm can thus enable a significant increase in sales for an ecommerce site or a digital platform. Sharing economy companies can adopt unsupervised learning to develop services on their platforms. For example, given a group of clients, a cluster analysis algorithm can bundle services and products that may interest its customers based on similarities in one or more other groups. Another example may involve the discovery of different groups of customers in an online store’s customer base. Even a chain restaurant can group clients based on menus selected by geographical location, and then change its menus accordingly. The examples of applications are many and varied in the sharing economy context; we encounter this unsupervised learning algorithm in customer
198
Sharing Economy and Big Data Analytics
segmentation–a subject of considerable importance in the marketing community– but also in fraud detection, etc. But how does it work? Good question–we need to look more closely at the details. 11.2.2. The clustering algorithm and the similarity measure The question now is: how does cluster analysis measure similarity between objects? In general, cluster algorithms examine a set number of data characteristics and map each data entity onto a corresponding point in a dimensional diagram. These algorithms then combine the similar elements based on their relative proximity in the graph. The goal is to divide a set of objects, represented by input data: { ,
,...,
}
into a set of disjoint clusters: {{
,
,
,
,...,
,
, }, {
,
,
,
,...,
,
},…, {
,
,
,
,...,
,
}}
to build objects that are similar to each other. Clustering consists of classifying data into homogeneous groups, called “clusters”, so that the elements of the same class are similar and the elements belonging to two different classes are different. It is therefore necessary to define a similarity measure for any two data points: the distance. Each element can be defined by the values of its attributes or what we call from a mathematical point of view “a vector”:
=
⋮ ⋮
The number of elements in this vector is the same for all elements and is called the “vector dimension” ( vector).
Cluster Analysis
199
Attribute 2
Attribute 2
Given two vectors 1 and 2. To measure similarity, we need to calculate the distance between these two vectors. Generally, this similarity is defined by the Euclidean distance or by the Manhattan distance (Figure 11.1).
Attribute 1 Euclidean distance Distance euclidienne
Attribute 1 Manhattan Distance dedistance Manhattan
Figure 11.1. Distance calculation
The Euclidean distance presents the distance between two points identified by their coordinates and . We have:
( , )=
(
and
)
−
The Manhattan distance, whose name is inspired by the famous New York borough that contains many skyscrapers, calculates the distance as follows: ( , )=
|
−
|
To calculate the similarity, it is best to choose the Euclidean distance as the metric because it is simpler and more intuitive. Figure 11.2 shows how to group objects based on similarity. The blue segment refers to the distance between the two points. Once we have calculated the distances between each point, the clustering algorithm automatically classifies the nearest or most “similar” points into the same group.
200
Sharing Economy and Big Data Analytics
Figure 11.2. The principle of clustering. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
Clustering algorithms are heavily dependent on the definition of the concept of similarity, which is often specific to the application domain (Sedkaoui 2018b). The principle consists of assigning classes to: – minimize the distance between the elements in the same cluster (intra-class distance); – maximize the distance between each cluster (inter-class distance). The goal is to identify groups of similar inputs and to find a representative value for each cluster. You can do this by examining the input data-sets, each of which is a vector in a dimensional space. In this context, two algorithms are frequently used: kmeans and hierarchical classification. In the next section we will discuss the k-means algorithm, which is widely used in the workplace, in detail. 11.3. Grouping similar objects using k-means Clustering is a technique for searching for groups of similar objects, or objects that are more closely related to each other than objects belonging to other groups. Examples of clustering applications, which we have previously cited from sharing economy companies, can include the grouping of documents, music, and films on various subjects, or of customers with similar interests (bikes etc.) based on
Cluster Analysis
201
purchasing behavior, or of requests for a common service as the basis for a recommendation engine. One of the simplest and most prevalent cluster analysis algorithms is undoubtedly the k-means method. This algorithm can be defined as an approach for organizing the elements of a collection of data into groups based on similar functionality. Its objective is to partition the set of entries in a way that minimizes the sum of squared distances from each point to the average of its assigned cluster. For example, say you’re looking to start a service on a digital platform and you want to send a different message to each target audience. First, you must assemble the target population into groups. Individuals in each group will have a degree of similarity based on age, gender, salary, etc. This is what the k-means algorithm can do! This algorithm will divide all of your entities into groups, with being the number of groups created. The algorithm refines the different clusters of entities by iteratively calculating the average midpoint or “centroid” mean for each cluster (Sedkaoui 2018b). The centroids become the focus of iterations, which refine their locations in the graph and reassign data entities to adapt to new locations. This procedure is repeated until the clusters are optimized. This is how this algorithm can help you deal with your problem. We will now describe this algorithm to show you how it works and how we can determine the average. 11.3.1. The k-means algorithm As you will see in a moment, the k-means algorithm is extremely easy to implement, and it is also very efficient from the viewpoint of calculations compared to other cluster analysis algorithms, which might explain its popularity. The k-means algorithm belongs to the category of clustering by prototype. We will discuss another type of clustering, hierarchical clustering, later in this chapter. Prototype-based clustering means that each cluster is represented by a prototype, which may be the centroid (average) of similar points with continuous characteristics or the medoid (most representative or most frequent) in the case of categorical characteristics. Box 11.1. Cluster analysis based on a prototype
202
Sharing Economy and Big Data Analytics
In real clustering applications, we have no information about the data categories. Thus, the objective is to group similar data-sets, which can be done using the kmeans algorithm, as follows: – randomly choose centers;
centroids from the group of data points as initial cluster
– assign each group of data to the nearest centroid; – assign the class with the minimum distance; – repeat the previous two steps until the cluster assignments no longer change. K-means is a data analysis technique that identifies clusters of objects for a given value r of based on the proximity of the objects to the center of the groups (MacQueen 1967). The center is the arithmetic average of the attribute vector with dimensions for each group. We will now describe this algorithm to show you how it works and how it determines the average. To illustrate the method, we’ll consider a collection of B objects with attributes, and we’ll examine the two-dimensional case (n = 2). We chose this case because it is much easier for visualizing the k-means method. 11.3.1.1. Choosing the value of k The k-means algorithm can find the clusters. This algorithm will select k points randomly to initially represent each class. In this example, k = 3, and the initial centroids are indicated by red, green, and blue points in Figure 11.3.
Figure 11.3. Step 1: define the initial points for the centroids. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
Cluster Analysis
203
11.3.1.2. Assign each data group to a centroid The algorithm in this step will calculate the distance between each data point ( , ) and each centroid. It will then assign each point to the nearest centroid. This association defines the first k clusters (k = 3). 11.3.1.3. Assign each point to a class Here, the algorithm calculates the center of gravity for each cluster defined in the previous step. This center of gravity is the ordered pair of arithmetic averages of the coordinates of the points in the cluster. In this step, a centroid is calculated for each of the k clusters. 11.3.1.4. Update the representatives for each class We can now recalculate the centroid for each group (the center of gravity). Repeat the process until the algorithm converges. Convergence is reached when the calculated centroids do not change or when the centroid and the attributed points oscillate from one iteration to the next. Thus, in Figure 11.4, we can identify three clusters: – cluster 1: in red; – cluster 2: in blue; – cluster 3: in green.
Figure 11.4. Step 2: assign each point to the nearest centroid. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
11.3.2. Determine the number of clusters The k-means algorithm solves the following problem: “Starting with points and an integer , the problem consists of dividing the points into partitions in order to
204
Sharing Economy and Big Data Analytics
minimize some function.” K-means is the most frequently used clustering algorithm due to its speed compared to other algorithms. In practice, does not correspond to the number of clusters that the algorithm will find and where the items are stored, but rather to the number of centers (that is to say, to the center point of the cluster) (Sedkaoui 2018a). Indeed, a cluster may be represented by a circle with a center point and a radius. We seek to group N points. This algorithm begins with central points that define the centers of each cluster to which N points will be attributed at the end. In the first step, the N points are associated with these different centers (initially specified or chosen at random). The next step is to recalculate the centers compared to the average of all points in the cluster. As we’ve seen from the various steps we’ve already described, we must first find the points associated with the cluster by calculating the distance between the cluster center and these points. Then, the center of the cluster is recalculated until the centers stabilize, indicating that they have reached the optimal values for representing the N starting points. So, with this algorithm, clusters can be identified in a data-set, but what value should we select for ? The value of can be chosen based on a reasonable estimate or a predefined requirement. At this stage, it is very important to know the value of , as well as how many clusters are to be defined. So, a measure must be tested to determine a reasonably optimal value for . The k-means algorithm is easily applied to numerical data, in which the concept of distance can of course be applied. It should be noted as well that the most frequently used distance for grouping objects is the square of the Euclidean distance, which we have described previously, between two points and in an dimensional space. Based on this metric of Euclidean distance, we can describe the k-means algorithm as a simple optimization problem, an iterative approach to minimizing the sum of squared errors (intra-cluster) (SSE), sometimes also called the “cluster inertia”: =
−
()
=
−
()
Cluster Analysis
205
We can thus deduce that SSE is the sum of the squares of the distances between each point and the nearest centroid. If the points are relatively close to their respective centroid, the sum of the squares of the distances is relatively small. However, despite its importance, k-means can introduce some limitations related to the following points. 11.3.2.1. Categorical data As indicated at the end of the previous section, k-means does not support categorical data. In such cases, we use k-modes (Huang 1997), which we haven’t discussed in this chapter, but which is a method commonly used to group categorical data based on the number of differences between the respective components of the attributes. 11.3.2.2. The definition of the number of clusters It may well be that the data-sets have characteristics that make partitioning into three (3) clusters better than two (2) clusters. We may precede our k-means analysis with a hierarchical classification that will automatically set the best number of classes to select. In hierarchical classification, each object is initially placed in its own cluster. The objects are then combined with the most similar cluster. This process is repeated until all existing objects are combined into a single cluster. This method will be discussed in the next section. 11.4. Hierarchical classification The amount of data available today generates new problems for which traditional methods of analysis do not have adequate answers. Thus, the classical framework of clustering, which consists of assigning one or more classes to an instance, extends to the problems of thousands or even millions of different classes. With these problems, new avenues for research appear, such as how to reduce the complexity of classification, generally linear as a function of the number of the problem’s classes, which requires a solution when the number of classes becomes too large (Sedkaoui 2018b). Among these solutions, we can cite the hierarchical model. In this section, we’ll examine hierarchical classification as an alternative to cluster analysis.
206
Sharing Economy and Big Data Analytics
11.4.1. The hierarchical model approach Hierarchical classification creates a hierarchy of clusters. This algorithm will output a hierarchy or a structure that provides more information than a set of unstructured clusters. Hierarchical classification does not require prior specification of the number of classes. The benefits of hierarchical classification are counteracted by lower efficiency. Hierarchical classification is a classification approach that establishes hierarchies of clusters according to two main approaches. These two main approaches are: agglomerative clustering and divisive clustering: – agglomerative: or the bottom-up strategy, wherein each observation begins in its own group, and where pairs of groups are merged upwards in the hierarchy; – divisive: or the top-down strategy, in which all observations begin in the same group, and then groups are split recursively in the hierarchy. In other words, in divisive clustering, we start with a cluster that includes all of our samples, and we iteratively divide it into clusters until each cluster only contains one sample. On the other hand, agglomerative clustering takes the opposite approach. We start with each sample as an individual cluster and merge pairs of the closest clusters until only one cluster remains. In order to decide which clusters to merge, a measure of similarity between clusters is introduced. More specifically, this comprises a distance measurement and a linkage. The distance measurement is what we discussed earlier, and a linkage is essentially a function of the distances between the points. The two standard linkages for hierarchical classification are the single linkage and complete linkage. Single linkage makes it possible to calculate the distances between the most similar members for each pair of clusters and to merge the two clusters with the shortest distance between the most similar members. Complete linkage is similar to single linkage, but instead of comparing the most similar members in each pair of clusters, this method compares the most dissimilar members to perform the merger. One advantage of this algorithm is that it allows us to draw dendrograms (visualizations of a binary hierarchical classification), which can help in the interpretation of results by creating meaningful taxonomies. Another advantage of this hierarchical approach is that it is not necessary to specify the number of clusters in advance.
Cluster Analyysis
207
11.4.2. Dendrogram ms Hieraarchical classiification provvides a tree-lik ke graphical representation r n called a “dendroggram”. This represents r alll of the classes created duuring executioon of the algorithm m. These claasses are grouuped, becausse the iteratioons are basedd on the comparisson of distancces or differennces between individuals i annd/or classes ((Sedkaoui 2018b). The size of the t dendrograam branches is i proportionaal to the meaasurement between the grouped objects. o Grapphically, we must m divide whhere the brancches of the deendrogram aree highest; in other words, stop distributing innto classes. The T graph of the evolutionn of intraclass ineertia as a functtion of the nuumber of clusters, also aids the graphical selection of the nuumber of clusters or in thhe analysis pro ocess. This deecreases as the number of classees increases.
Figure 11.5. K-means verssus hierarchic cal classificatio on. For a colorr version of this th figure, see e www.iste.co..uk/sedkaoui/e economy.zip
Indeeed, they becom me more hom mogeneous and d decrease in size, and the elements come clooser and closeer to their clusster centroid. This is actuallly an abrupt rreduction of intra-cclass inertia based on the nuumber of clusters to be idenntified. Diffeerent aggregattion strategiess between two o classes A annd B can be coonsidered during the t constructiion of the deendrogram. Th he best know wn are singlee linkage, completee linkage, andd the Ward critterion. The Ward W criterion is based on a bottom-up app proach (agglom merative). Eachh sample begins in its own clustter, and pairs of o clusters are merged m graduallly as one movees up the hierarchhy. In this conttext, the selectiion criterion fo or the pair of cllusters to be m merged at each sttep is the minimum variancee criterion. The Ward minim mum variance criterion minimizzes the total varriance within cllusters at each merger: m (
) Box 11.2. Ward cri riterion
208
Sharing Economy and Big Data Analytics
This criterion considers as a Euclidean distance, and as the centroids of and as the respective weights of classes A and B. the relevant classes, and The main question that arises in practice is: what criterion should we choose and why? The answer to this question is summarized in Table 11.1. Criterion Single linkage Complete linkage Ward Criterion
Use Associated with a minimal tree Creates compact classes that close arbitrarily Minimizes the rise of intra-class inertia
Table 11.1. Comparison of different criteria
However, clusters generally have close variances. This is why the most popular choice for users is undoubtedly the Ward criterion. This criterion tends to create rather spherical classes and equal numbers for the same level of the dendrogram. It also is relatively easy to use, even though the distance between elements is nonEuclidean. Before turning to clustering applications, it should be noted that hierarchical classification is time-consuming. This only applies to incoherent data-sets in terms of the number of elements. The dendrogram is also less readable when this number increases. Now that you know how these two clustering algorithms work, let’s see how to implement them with Python. 11.5. Discovering hidden structures with clustering algorithms Clustering maximizes the variance between groups and minimizes the differences within clusters in order to bring the different observations closer together. The database that we will analyze contains 59,740 observations. In practice, we will perform data classification for different values of . For each, we will calculate an adjustment. Then we will choose the value of that leads to the best compromise between a fairly good fit to the data and a reasonable number of parameters to estimate.
Cluster Analysis
209
The objective is to develop a model that provides a better understanding of all observations. During the analysis process, the clusters will be identified and described. In what follows, we will use the k-means algorithm to separate the data into groups and thus to create clusters with similar values. 11.5.1. Illustration of the classification of prices based on different characteristics using the k-means algorithm As we saw at the beginning of this chapter, the k-means algorithm is used to group observations near their average. This enables greater efficiency in data categorization. In this case, the k-means algorithm is used to group characteristics that influence the price of Airbnb apartments in the city of Paris. To run this algorithm, we will use the data-set we used in the previous two chapters. To better analyze the data using Python, we will import the pandas, NumPy, pyplot and sklearn libraries. We also need the KMeans function from the Scikit Learn library. We will load it from sklearn’s cluster sub-module. Once the data is put into the right format, that is to say into a DataFrame, training the k-means algorithm can be facilitated using the Scikit Learn library. Our goal is to model the price for the different characteristics using the k-means algorithm. To build our model, we need a way to effectively project the price based on a set of predetermined parameters; therefore, we have chosen to apply this algorithm to identical neighborhoods and listings with similar attributes. This will ensure that the clusters result from properties that are as similar as possible, so that differences in attributes are the only factor affecting prices and so that we can determine their influence on the price. The aim is to exploit data on Airbnb prices in Paris. The results of the process will generate useful knowledge for price comparisons. First, we will normalize the data with Minimax scaling provided by sklearn to allow the k-means algorithm to interpret the data correctly. Figure 11.6 shows the scatter plot for our data.
210
Sh haring Economyy and Big Data Analytics A
Figure 11.6. Distribution of observationss
Figurre 11.6 showss the membership of each observation o inn the differentt clusters. Howeverr, the results are a not very telling, and vissually, we cann say that threee or four groups have h formed. This is the goal of o using the k-means k algorrithm: separating these obsservations he algorithm, we must firstt generate into diffferent groups. Before actually running th an “Elboow” curve to determine d the number of clu usters we neeed for our anallysis. The idea behhind the “Elboow” method is to find thee number of clusters for w which the imbalancce between the clusters incrreases most raapidly. 11.5.2. Identify the number of clusters k This method evaluuates the perceentage of the variance v explaained by the nnumber of o into i different groups. We ddo this in clusters. Its aim is to separate the observations d pricing models intto distinct grooups. It is order to rank the aparrtments with different thereforee appropriate to choose a number of clussters such thatt one additionnal cluster does nott provide a bettter model. Specifically, if onne plots the percentage p of variance expplained by thee clusters mber of clustters, the first clusters willl explain moost of the compareed to the num variancee. But at somee point, the maarginal gain of o informationn will drop, crreating an angle in the graph. Thhe number of clusters c is seleected at this point, p hence thhe “elbow criterionn”.
Cluster Analysis
211
The difficulty of any clustering method lies in the selection of the number of clusters, or k. In most cases, this number is unknown. With regard to the k-means algorithm and the hierarchical classification algorithm applied by calculating the Ward distance, it is possible to draw the intra-class inertia curve as a function of k. We then seek to identify the steps where a break is observed in this curve, indicating a sharp deterioration in intra-class inertia. This degradation results from the high heterogeneity of the two classes united during the relevant stage. In this case, we will consider a number of classes greater than that in which the break occurs. This strategy is called “the elbow criterion”. This criterion provides satisfactory results when applied. Box 11.3. The elbow criterion
This elbow cannot always be unambiguously identified. The percentage of variance explained is the ratio of the intergroup variance to the total variance, also called the “F test”. A slight variation on this method draws the intra-group variance curve. To set the number of clusters in the data-set, we will use the “elbow” curve. The elbow method takes into account the percentage of variance explained by the number of clusters: one must choose this number such that adding another cluster does not provide a better data model. Generally, the number of clusters equals the value of the x-axis at the point that is the corner of the elbow, which we described in Box 11.3 (the line often resembles an elbow). Figure 11.7 shows that the score, or the percentage of variance explained by the clusters, stabilizes at three clusters. That's what we’ll include in the k-means analysis.
Figure 11.7. Determination of the number of clusters using the elbow method. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
212
Sharing Economy and Big Data Analytics
This implies that the addition of several clusters does not explain much additional variance in our variable. Therefore, since we were able to identify the optimal number of clusters (k = 3) using the “elbow” curve, we’ll set n_clusters to three clusters. Once the appropriate number of clusters has been identified, principal component analysis (PCA) can be undertaken to convert data that is too scattered into a set of linear combinations that is easier to interpret. Now we can run the k-means algorithm. We simply have to instantiate an object of the KMeans class and tell it how many clusters we want to create. Thereafter, we will call the fit() method to compute the clusters. Figure 11.8 shows the membership of each observation by class: – cluster 1: observations with a Class 0 in blue; – cluster 2: observations with a Class 1 in red; – cluster 3: observations with a Class 2 in green.
Figure 11.8. The three clusters. For a color version of this figure, see www.iste.co.uk/sedkaoui/economy.zip
The application of the k-means algorithm has allowed us to divide data-sets into groups based on their characteristics, without needing to know their corresponding labels (y variable). Using this data led us to better understand the context of our main question. The process of analysis that we carried out in this part’s three chapters has allowed us to
Cluster Analysis
213
interpret the results and to easily identify the most important predictors for setting price. For each kind of analysis question, a specific group of methods called algorithms can be applied. These algorithms present a real opportunity for change. They make our programs smarter by making it possible for them to automatically learn from the data we provide. The purpose of this example and of all of the examples we have shown in this part is to demonstrate how you can read and understand your data, apply tools and methods, and visualize your results. The goal is to help you learn to apply the different techniques that we can use to analyze large amounts of data. With the various examples illustrated in this last part, you can see and understand how data analysis can generate additional knowledge and how it transforms ideas into business opportunities. Big Data analytics has revolutionized business by merging the immensity of big data to draw observations and unique deductions, never before imagined, to better predict the next step to be taken. 11.6. Conclusion This chapter provided a detailed explanation of cluster analysis. It helped you learn about two different cluster analysis algorithms that can be applied to find hidden structures in data-sets. To demonstrate them in practice, we used a sample database. In this chapter, you learned how to use the k-means algorithm to group similar observations using the Python programming language. The latter is easy to use. Now, it’s your turn to take up other examples, test a number of different clusters, and see what groups emerge. TO REMEMBER.– Overall, what you need to consider when using a clustering approach is summarized in the following points: – cluster analysis is an unsupervised learning method; – this method groups similar objects based on their attributes; – the most frequently used clustering algorithm is the k-means algorithm. To ensure proper implementation, it is important to:
214
Sharing Economy and Big Data Analytics
- properly adjust the attribute values; - ensure that the distance between the assigned values is significant; - set the number of groups k so that the SSE is reasonably minimized; – if k-means does not appear to be an appropriate classification technique for a given set of data, alternative techniques such as hierarchical classification should be considered.
Conclusion
Via the three chapters in Part 3, we learned to work with data, define problems, solve them using different methods, and select from among several techniques. Throughout the book, we also addressed these questions: What is Big Data? Why has this phenomenon become so important in the business context? How can we extract value from data? How can sharing economy businesses seize the opportunities offered by data-sets? Obviously, these companies can adopt a data-driven approach in several ways. By using Big Data technologies, exploring new methods that can find correlations between the quantities of available data, developing algorithms and tools capable of handling a wide variety of data, optimizing the analysis process, etc. This gives an overview of how these companies can develop a new model in the sharing economy context through the use of data analysis tools that allow them to explore large amounts of data and to generate value. So, we come to the end of this book and its main objective, which is to illustrate the importance of Big Data analytics for sharing economy companies. We hope that you have gained a basic understanding of these two areas. It is for you now to demonstrate your skills! The different algorithms for data analysis and the various Big Data technologies will probably make your work easier. But, as the Americans say: “People make the difference!”
Sharing Economy and Big Data Analytics, First Edition. Soraya Sedkaoui and Mounia Khelfaoui. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.
References
Ackoff, R.L. (1989). From data to wisdom. Journal of Applied Systems Analysis, 15, 3–9. Acquier, A., Daudigeos, T., Pinkse, J. (2017). Promises and paradoxes of the sharing economy: An organizing framework. Technological Forecasting & Social Change, 125, 1–10 [Online]. Available at: https://www.sciencedirect.com/science/article/pii/S004016 2517309101?via%3Dihub. ADEME (2012). Étude du projet éco-conception. Cahier des charges [Online]. Available at: http://www.diagademe.fr/diagademe/. ADEME (2017). L’économie de la fonctionnalité: de quoi parle-t-on ? Report [Online]. Available at: https://www.ademe.fr/sites/default/files/assets/documents/economie_ fonctionnalite_definition_201705_note.pdf. Alter, N. (2010). Donner et prendre: la coopération en entreprise. La Découverte, Paris. Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.D. (2000). An Evaluation of Naive Bayesian Anti-Spam Filtering. In Proceedings of the Workshop on Machine Learning in the New Information Age, Potamias, G., Moustakis, V., Van Someren, M. (eds). 11th European Conference on Machine Learning, Barcelona. Athané, F. (2008). Le don: histoire du concept, évolution des pratiques. PhD thesis, Université Paris 10 [Online]. Available at: http://www.theses.fr/2008PA100112. Beer, D., Burrows, R. (2007). Sociology and, of and in Web 2.0: Some Initial Considerations [Online]. Available at: https://doi.org/10.5153%2Fsro.1560. Benavent, C. (2017). Disruption à l’âge des plateformes. Économie et management, 165, 11– 19. Benkler, Y. (2004). Sharing Nicely: On Shareable Goods and the Emergence of Sharing as a Modality of Economic Production. The Yale Law Journal, 114, 273–358 [Online]. Available at: https://www.dropbox.com/s/ig8955sggxjd1h0/Sharing%20Nicely%20 Benkler_FINAL_YLJ114-2.pdf.
Sharing Economy and Big Data Analytics, First Edition. Soraya Sedkaoui and Mounia Khelfaoui. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.
218
Sharing Economy and Big Data Analytics
Benoît-Moreau, F. (2006). La première rencontre mémorable entre marque et consommateur et son influence sur la relation: exploration par une approche qualitative phénoménologique. XXIIe Congrès de l’AFM, Nantes [Online]. Available at: https://www. academia.edu/14107108/La_premi%C3%A8re_rencontre_m%C3%A9morable_entre_un_ consommateur_et_une_marque. Benoît-Moreau, F., Delacroix, E., Parguel, B. (2017). Les bénéfices de l’économie collaborative pour les consommateurs financièrement contraints: le cas des sites d’achat/vente de seconde main. Congrès de l’association française de marketing, Tours [Online]. Available at: https://hal.archives-ouvertes.fr/hal-01819634. Bensoussan, A., Fabre, A. (eds). (2018). The Digital Factory for Knowledge: Production and Validation of Scientific Results. ISTE Ltd, London, and Wiley, New York. Bernet, V. (2014). L’approche lean startup. Comment maximiser les chances de trouver son marché lors du lancement d’une startup. Haute École de gestion et tourisme, Sierre [Online]. Available at: http://doc.rero.ch/record/235848/files/TB_Bernet_Valentin.pdf. Bicrel, J. (2012). Fiche de lecture: What’s mine is yours, how collaborative consumption is changing the way we live, Rachel Botsman et Roo Rogers (2010). Observatoire du management alternatif [Online]. Available at: http://appli6.hec.fr/amo/Public/Files/Docs/ 241_fr.pdf. Bigot, R., Hoibian, S., Muller, J. (2014). La connaissance du développement durable et de l’économie circulaire en 2014. Étude réalisée pour le compte de l’ADEME par le CREDOC, Sierre [Online]. Available at: http://www.credoc.fr/pdf/Sou/Connaissance_ developpement_durable_et_economie_circulaire_2014.pdf. Bishop, C. (2006). Pattern Recognition and Machine Learning. Springer, New York. Borel, S., Demailly, D., Massé, D. (2015). Les fondements théoriques de l’économie collaborative. In Le Rapport moral sur l’argent dans le monde. AEF [Online]. Available at: https://www.researchgate.net/publication/287935542/download. Botelho, K.R. (2008). La Société de consommation de Jean Baudrillard. Fiche de lecture [Online]. Available at: http://appli6.hec.fr/amo/Public/Files/Docs/68_fr.pdf. Botsman, R. (2014). Sharing’s Not Just for Start-Ups. Harvard Business Review, 92(9), 23– 25. Botsman, R., Rogers, R. (2010). What is Mine Is Yours: The Rise of Collaborative Consumption. Collins, London. Bouillot, P.E. (2011). Le découplage entre la consommation des ressources naturelles et la croissance économique en question [Online]. Available at: https://programmelascaux. word press.com/2011/05/25/le-decouplage-entre-la-consommation-des-ressources-naturelles-etla-croissance-economique-en-question/. Bourg, D., Buclet, N. (2005). L’économie de fonctionnalité. Changer la consommation dans le sens du développement durable. Futuribles, 313(313), 27–38.
References
219
Bower, J.L., Christensen, C.-M. (1995). Disruptive Technologies: Catching the Wave. Harvard Business Review [Online]. Available at: https://hbr.org/1995/01/disruptivetechnologies-catching-the-wave. Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123–140. Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. Brousseau, E., Curien, N. (2001). Économie d’Internet, économie du numérique. Revue économique, 52, 7–36. Brousseau, E., Penard, T. (2007). The Economics of Digital Business Models: A Framework for Analyzing the Economics of Platforms. Review of Network Economics, 6(2), 81–114 [Online]. Available at: https://www.researchgate.net/publication/24049766_The_ Economics_of_Digital_Business_Models_A_Framework_for_Analyzing_the_Economics _of_Platforms. Buclet, N. (2005). Concevoir une nouvelle relation à la consommation: l’économie de fonctionnalité. Annales des Mines – Responsabilité et environnement, 57–66 [Online]. Available at: https://hal.archives-ouvertes.fr/hal-00129110. Buda, G., Lehota, J. (2017). Attitudes and Motivations of Consumers in Sharing Economy. Colloque Management, Enterprise and Benchmarking in the 21st Century, Budapest [Online]. Available at: https://kgk.uni-obuda.hu/sites/default/files/02_ Buda_Lehota.pdf. Busuttil, T. (2016). Pour une économie collaborative « responsable et vertueuse ». Le Monde [Online]. Available at: https://www.lemonde.fr/idees/article/2016/02/22/pour-uneeconomie-collaborative-responsable-et-vertueuse48698253232 .html. Carvalho, G., Dzimira, S. (2000). Don et économie solidaire. Édition Mauss [Online]. Available at: http://www.journaldumauss.net/IMG/pdf/_Don__cosolidaire -2.pdf. Cases, P. (2016). Sharing economy’s “billion-dollar club” is going strong, but investor risk is high. Venture Beat [Online]. Available at: https://venturebeat.com/2016/02/07/sharingeconomys-billion-dollar-club-is-going-strong-but-investor-risk-is-high/. CCI (2014). L’économie collaborative. Mythe, mode ou réalité ? Lettre thématique, 16 [Online]. Available at: http://www.campusfonderiedelimage.org/pushstartup/wp-content/ uploads/2016/01/%C3%A9conomie-collaborative-mythe-mode-r%C3%A9alit%C3%A9. pdf. Chaput, O. (2015). Seul on va plus vite, ensemble on va plus loin. Texte préparatoire, EcoRes [Online]. Available at: https://www.goo gle.fr/url?sa=t&rct =j&q=&esrc=s&source= web& cd=5&cad=rja&uact=8&ved=2ahUKEwi4loaC74fiAhWInhQKHRKrA8cQFjAEegQIBhA C&url=https%3A%2F%2Ftheshift.be%2Fmember-attachment%2F00344c843b0cebfacba3 2cbe37f9173fbf3f7a8a&usg=AO vVaw31vAj-SJzKW_H9nDrYVzS-. Chen, H., Chiang, R.H., Storey, V.C. (2012). Business intelligence and analytics: From big data to big impact. MIS Quarterly, 36(4). Chen, M.S., Liu, Y. (2014). Big Data: A survey. Mobile Networks and Applications, 19(2), 171–209.
220
Sharing Economy and Big Data Analytics
Christensen, C.M., Raynor, M., McDonald, R. (2015). What Is Disruptive Innovation? Harvard Business Review, 93(12), 44–53. Coase, R. (2005). L’Entreprise, le marché et le droit. Éditions d’Organisation, Paris. Condé, B. (2017). Économie collaborative: nouvelle rupture ou ultime ruse du capitalisme ? Thesis, Université de Liège [Online]. Available at: https://matheo.uliege.be/bitstream/ 2268.2/2551/4/Me%CC%81moire%20-%20Benjamin%20Cond%C3%A9.pdf. Cooper, R., Timmer, V. (2015). Les administrations locales de l’économie du partage. Feuille de route. One Earth [Online]. Available at: http://www.localgovsharingecon.com/ uploads/2/1/3/3/21333498/localgovsharingecon_resume_francais.pdf. Courtois, G. (2016). L’économie collaborative: Révolution du partage ou ultime ruse du capitalisme ? Study. CPCP [Online]. Available at: http://www.cpcp.be/medias/pdfs/ publications/economie_collaborative.pdf. Cristianini, N., Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines and Other Kernel Based Learning Methods. Cambridge University Press, Cambridge. Cukier, K., Mayer-Schonberger, V. (2013). Big Data: A Revolution That Will Transform How We Live, Work and Think. Houghton Mifflin Harcourt, Boston. Curry, E. (2016). The Big Data Value Chain: Definitions, Concepts, and Theoretical Approaches. In New Horizons for a Data-Driven Economy, Cavanillas, J., Curry, E., Wahlster, W. (eds). Springer, New York. Damesin, N. (2013). Économie de fonctionnalité. Freins et leviers à l’intégration de ce modèle économique dans les entreprises. Savoir UdeS [Online]. Available at: https://www.usher brooke.ca/environnement/fileadmin/sites/environnementdocuments/Essais_2013/Damesin _N__2014-01-16_.pdf. Davenport, T.H. (2014). Big data at work: Dispelling the myths, uncovering the opportunities. Harvard Business Review Press, Boston. Davenport, T.H., Dyché, J. (2013). Big Data in big companies. Report, International Institute for Analytics [Online]. Available at: https://docs.media.bit pipe.com/io_10x/io_102267/ item_725049/Big-Data-in-Big-Companies.pdf. Davenport, T.H., Harris, J.G. (2007). Computing analytics: the new science of winning. Harvard Business Review Press, Boston. Decrop, A. (2017). La consommation collaborative: Enjeux et défis de la nouvelle société du partage. De Boeck supérieur, Louvain-la-Neuve. Deepak, V., Sanjay, P. (2005). Data in P2P systems [Online]. Available at: https:// pdfs.semanticscholar.org/68b4/84862828bf49fe4932e421f331bcb399c05d.pdf. Delen, D., Demirkan, H. (2013). Data, information and analytics as services. Decision Support Systems, 55(1), 359–363.
References
221
Deloitte, M. (2015). Ubérisation: Partager ou Mourir!? L’économie on-demand, ou collaborative, est un modèle disruptif qui appelle un nouveau regard sur l’innovation et sur le leadership. Présentation de travail [Online]. Available at: http://www.seratoo.com/ MarketingWeb2.0/wp-content/uploads/2015/08/Etude-%C3%A9conomie-on-demand.pdf. Demailly, D., Novel, A.-S. (2014). The Sharing Economy: Make It Sustainable. IDDRI, 3(14). Désert, M. (2014). La consommation collaborative: une révolution citoyenne ? Pour la solidarité [Online]. Available at: http://www.pourlasolidarite.eu/sites/default/files/ publications/files/2014_06_consommation_collaborative.pdf. Ding, C.H., Nutanong, S., Buyya, R. (2005). Peer-to-Peer Networks for Content Sharing. In Peer-to-Peer Computing: The Evolution of a Disruptive Technology, Subramanian, R., Goodman, B. (eds), 28–65. IGI Global, Hershey. Diridollou, C., Delécolle, T., Loussaeif, L., Delchet-Cochet, K. (2016). Légitimité des business models disruptifs: le cas Uber. Cairn [Online]. Available at: https://www.research gate.net/publication/314717563Legitimitedesbusinessmodelsdisruptifs_le_cas_Uber. Drahokoupil, J., Fabo, B. (2016). The platform economy and the disruption of the employment relationship. ETUI Policy Brief, 5, 1–6. Drucker, P. (1994). Innovation and Entrepreneurship: Practice and Principles. Heinemann, London. Dubreuil, V. (2017). Les dons d’entreprise. Tendances, nouveaux modèles et conseils pour optimiser vos partenariats. Association des professionnels en gestion philanthropique, Montreal [Online]. Available at: http://www.apgp.com/wp-content/uploads/2016/11/AP GP_-Coaching-KCI-Dons-dentreprise.pdf. Dufau, J.-P. (2010). L’intelligence économique. Report, Assemblée parlementaire de la francophonie [Online]. Available at: http://apf.francophonie.org/IMG/pdf/2010_ccd_ rapport_intelEco.pdf. Dupuis, F., Noreau, J. (2016). L’économie de partage: une boîte noire. Desjardins, Études économiques, 26, 1–6. Eckhardt, G.M., Bardhi, F. (2015). The Sharing Economy Isn’t About Sharing at All. Harvard Business Review [Online]. Available at: https://hbr.org/2015/01/the-sharingeconomy-isnt-about-sharing-at-all. Einav, L., Levin, J. (2014). The Data Revolution and Economic Analysis. Innovation Policy and the Economy, 14. Ertz, M. (2017). Quatre essais sur la consommation collaborative et les pratiques de multiples vies des objets. Administration PhD thesis, Université du Québec, Montreal [Online]. Available at: https://archipel.uqam.ca/11099/1/D 3350.pdf. Ertz, M., Durif, F., Arcand, M. (2017). An Analysis of The Origins of Collaborative Consumption and its Implications for Marketing. Academy of Marketing Studies Journal, 21(1), 1–17.
222
Sharing Economy and Big Data Analytics
Evroux, A.-F., Jacquemin, M., De Mentque, Q., Rodet, F., Thocquenne, B. (2014). L’Économie collaborative: nouveau vecteur d’influence et de reconquête du pouvoir. Groupe ESLSCA [Online]. Available at: http://bdc.aege.fr/public/Economie_ collaborative_vecteur_influence.pdf. Fondation du Roi Baudouin (2016). L’économie collaborative, une opportunité pour les plus pauvres ? Étude exploratoire. Report [Online]. Available at: https://www.kbs-frb.be/fr/ Virtual-Library/2016/20161214DD. Foster, I., Ghani, R., Jarmin, R.S., Kreuer, F., Lane, J. (2017). Big Data and Social Science. CRC Press, Boca Raton. Frankel, F., Reid, R. (2008). Big Data: distilling meaning from data. Nature, 455, 30. Frenken, K., Schor, J. (2017). Putting the sharing economy into perspective. Environmental Innovation and Societal Transitions, 23, 3–10. Frizzo-Barker, J., Chow-White, P.A., Mozafari, M., Ha, D. (2016). An empirical study of the rise of big data in business scholarship. International Journal of Information Management, 36(3), 403–413. Ganapati, S., Reddick, C. (2018). Prospects and challenges of sharing economy for the public sector. Government Information Quarterly, 35, 77–87. Gartner, (2017). Gartner Says Worldwide IT Spending Forecast to Grow 2.7 Percent in 2017. Analysis [Online]. Available at: https://www.gartner.com/en/newsroom/press-releases/ 2017-01-12-gartner-says-worldwide-it-spending-forecas-to-grow-2-percent-in-2017. Gatteschi, V., Lamberti, F., Demartini, C., Pranteda, C., Santamaría, V. (2018). Blockchain and Smart Contracts for Insurance: Is the Technology Mature Enough? Future Internet, 10(20). Gibbs, C., Guttentag, D., Gretzel, U., Morton, J., Goodwill, A. (2018). Pricing in the sharing economy: a hedonic pricing model applied to Airbnb listings. Journal of Travel & Tourism Marketing, 35(1), 46–56. Glaeser, E.L., Kim, H., Luca, M. (2017). Nowcasting the Local Economy: Using Yelp Data to Measure Economic Activity. NBER Document de travail 24010, Harvard Business School. Global Investor (2015). The sharing economy: New opportunities, new questions. Investment Strategy & Research [Online]. Available at: https://www.oxfordmartin. ox.ac.uk/downloads/GI_215eGesamtPDF01high.pdf. Godboult, J.T. (1992). L’esprit du don. La Découverte, Paris. Gomez, P.Y., Masclef, O., Grevin, A. (2015). L’Entreprise, une affaire de don – ce que révèlent les sciences de gestion. Nouvelle Cité, Bruyères-le-Châtel [Online]. Available at: https://www.researchgate.net/publication/280932341L’Entreprise_une_Affaire_de_Don__ce_que_revelent_les_sciences_de_gestion.
References
223
Goyette-Côté, M.O. (2013). Les nouvelles formes du travail, ou comment la notion de “prosumer” permet d’analyser les pratiques participatives sur l’Internet [Online]. Available at: https://archipel.uqam.ca/5532/1/Goyette-Cote.pdf. Grifoni, P., D’Andrea, A., Ferri, F., Guzzo, T., Angeli Felicioni, M., Praticò, C., Vignoli, A. (2018). Sharing Economy: Business Models and Regulatory Landscape in the Mediterranean Areas. International Business Research, 11(5). Gutierrez, D. (2015). Machine Learning and Data Science: An Introduction to Statistical Learning Methods with R. Technics Publications, Basking Ridge. Hallet, A. (2018). L’économie collaborative: définitions, impacts fiscaux et sociaux. Mémoire de recherche, Université de Liège [Online]. Available at: https://matheo.uliege.be/bitstream/2268.2/4837/4/s131668Hallet2018.pdf. Hart, K. (2008). Karl Polanyi: prophète de la fin de l’économie libérale. Revue Interventions économiques, 38 [Online]. Available at: https://journals.openedition.org/interventions economiques/304. Hazen, B.T., Boone, C.A., Ezell, J.D., Jones Farmer, L.A. (2014). Data quality for data science, predictive analytics and Big Data in supply chain management: An introduction to the problem and suggestions for research and applications. International Journal of Production Economics, 154, 72–80. Henke, N., Bughin, J., Chui, M., Manyika, J., Saleh, T., Wiseman, B., Sethupathy, G. (2016). The age of analytics: Competing in a Data-driven world. Report, McKinsey Global Institute. Hodkinson, P. (2017). Media, Culture and Society: An Introduction. SAGE Publications, London. Huang, Z. (1997). A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining [Online]. Available at: http://citeseerx.ist.psu. edu/viewdoc/download? doi=10.1.1134.83&rep=rep1&type=pdf. Institut d’assurance (2017). L’économie du partage : conséquences pour l’industrie de l’assurance au Canada. Série d’études sur les nouvelles réalités [Online]. Available at: https://www.insuranceinstitute.ca/-/media/CIP-Society/Emerging-Issues-Research-Series/ Sharing-Economy/French/Sharing-Economy-report-2017-FR-WEB.pdf?la=fr&hash=D8 B8BF421FCF3036D9EB18A3323E5FD8D95093E8. Jacquet, E. (2015). Le “prêt payant”: Les paradoxes de l’économie collaborative. Réseaux, 23(190-191), 99–120 [Online]. Available at: https://doi.org/10.3917/res.190.0099. Jacquet, S. (2017). La disruption, une forme d’innovation à manager. Centre de Ressources en Économie de gestion [Online]. Available at: https://creg.ac-versailles.fr/IMG/pdf/article ladisruption.pdf. John, N.A. (2017). The Age of Sharing. Polity, Malden. Joseph, L.B., Christensen, C.M. (1995). Disruptive technologies, catching the Wave. Harvard Business Review, 43–53 [Online]. Available at: http://vedpuriswar.org/articles/ Disruptive%20technologies-Catching%20the%20wave.pdf.
224
Sharing Economy and Big Data Analytics
Keen, A. (2007). The Cult of the amateur: How today internet is killing our culture. Currency, Danvers [Online]. Available at: https://filmadapter.files.word press.com/2014/10/andrew_ keen_the_cult_of_the_amateur_how_todaysbookfi-org.pdf. Kelly, T., Liaplina, A., Tan, S.W., Winkler, H. (2017). Reaping Digital Dividends Leveraging the Internet for Development in Europe and Central Asia. The World Bank, Washington [Online]. Available at: https://www.worldbank.org/en/region/eca/publication/digitaldividends-in-eca. Kestemont, B. (2015). Empreinte écologique. In Dictionnaire de la pensée écologique, Bourg, D., Papaux, A. (eds), 393–396. PUF, Paris [Online]. Available at: https://www. researchgate.net/publication/287643657Empreinteecologique/ download. Kirk, D.E. (2012). Optimal control theory: an introduction. Courier Corporation, Chelmsford. Koulouris, T. (2010). A framework for the dynamic management of Peer-to-Peer overlays. Thesis, UCL [Online]. Available at: http://discovery.ucl.ac .uk/19705/1/19705.pdf. Lambrecht, M. (2016). L’économie des plateformes collaboratives. Courrier hebdomadaire du CRISP, 2311-2312, 5–80. Lasida, E. (2009). Le don fondateur du lien social, le cas de l’économie de marché. Transversalités, 126, 23–35 [Online]. Available at: https://doi.org/10.3917/trans.126. 0023. Lasida, E., Lompo, K.M., Dubois, J.-L. (2009). La pauvreté: une approche socio-économique. Transversalités, 111, 35–47. Laurent, E. (2011). Faut-il décourager le découplage? Revue de l’OFCE, Presses de Sciences Po, 235–257 [Online]. Available at: https://halshs.archives-ouvertes.fr/hal-01024198/. Lavastre, O. (2001). Les Coûts de transaction et Olivier E. Williamson: Retour sur les fondements. In XIe Conférence de l’Association Internationale de Management Stratégique, 13–15 June, Quebec. Lechien, R., Tinel, L. (2016). Uberisation: définition, impacts et perspectives. Travail de fin d’études, Louvain School of Management, Louvain-la-Neuve [Online]. Available at: http://www.ipdigit.eu/wp-content/uploads/2016/09/TFERen anLechienetLouisTinel.pdf. Lemoine, L. et al. (2017). La construction de la confiance sur une plateforme de l’économie collaborative. Une étude qualitative des critères de choix d’un covoitureur sur Blablacar. Question(s) de management, 19, 77–89 [Online]. Available at: https://doi.org/10.3917/ qdm.174.0077. Lesteven, G., Godillon, S. (2017). Les plateformes numériques révolutionnent-elles la mobilité urbaine? Analyse comparée du discours médiatique de l’arrivée d’Uber à Paris et à Montréal. NETCOM, Mobilité et (r)évolution numérique, 31-3/4, 375–402 [Online]. Available at: https://journals.openedition.org/netcom/2756. Lobel, O. (2018). Coase and the platform economy. In The Cambridge handbook of the law of the sharing economy, Davidson, N.M., Finck, M., Infranca, J.J. (eds), 67–77. Cambridge University Press, Cambridge.
References
225
MacQueen, J. (1967). Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, University of California Press, Berkeley. Maheshwari, A. (2019). Data Analytics Made Accessible. Kindle Edition. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., Hun Byers, A. (2011). Big data: The next frontier for innovation, competition, and productivity. Technical report, McKinsey Global Institute. Marr, B. (2016). Big Data in Practice (Use Cases) – How 45 Successful Companies Used Big Data analytics to Deliver Extraordinary Results. Wiley, Hoboken. Masclef, O. (2013). Le rôle du don et du gratuit dans l’entreprise: théories et évidences. Économies et Sociétés, 22, 7–31. Massé, D., Borel., S., Demailly, D. (2015). Comprendre l’économie collaborative et ses promesses à travers ses fondements théoriques. IDDRI, 5(15) [Online]. Available at: https://www.iddri.org/sites/default/files/import/publications/wp0515pico_fondementstheoriques.pdf. Mauss, M. (1923). Essai sur le don: Formes et raison de l’échange dans les sociétés archaïques. L’Année sociologique. McAfee, A., Brynjolfsson, E. (2011). Race Against the Machine: How the Digital Revolution is Accelerating Innovation, Driving Productivity, and Irreversibly Transforming Employment and the Economy. Digital Frontier Press, Lexington. McKelvey, M., Zhu, Y. (2013). Business models in Big Data in China: Opportunities through sequencing and bioinformatics. In How Entrepreneurs Do What They Do: Case Studies of Knowledge Intensive Entrepreneurship, McKelvey, M., Lassen, A.H. (eds). Edward Elgar Publishers, Cheltenham. Michelini, L., Principato, L., Iasevoli, G. (2018). Understanding food sharing models to tackle sustainability challenges. Ecological Economics, 145, 205–217. Milojicic, D.S., Kalogeraki, V., Lukose, R., Nagaraja, K. (2001). Peer-to-Peer Computing [Online]. Available at: https://www.hpl.hp.com/techreports/2002 /HPL-2002-57R1.pdf. Moore, G.E. (1965). Cramming More Components onto Integrated Circuits. Electronics, 38(8), 114–117. Morabito, V. (2015). Big data and analytics: strategic and organizational impacts. Springer International Publishing, New York. Nicklaus, D. (2017). Produire plus avec moins de matières: pourquoi ?, Ministère de la Transition écologique et solidaire [Online]. Available at: https://www.ecologiquesolidaire.gouv.fr/sites/default/files/Th%C3%A9ma%20%20Produire%20plus%20avec%20 moins%20de%20mati%C3%A8res.pdf. Nicot, A.-M. (2017). Le modèle économique des plateformes: économie collaborative ou réorganisation des chaînes de valeur? La Revue des Conditions de Travail, 6, 1–10.
226
Sharing Economy and Big Data Analytics
NVP (2017). Big Data Executive Survey 2017. Report, NewVantage Partners [Online]. Available at: https://newvantage.com/wp-content/uploads/2017/01/Big-Data-ExecutiveSurvey-2017-Executive-Summary.pdf. Nyce, C. (2007). Predictive analytics white paper. Report, American Institute for CPCU, Insurance Institute of America, 9–10. O’Toole, J., Matherne, B. (2017). Uber: Aggressive management for growth. The Case Journal, 13, 561–586 [Online]. Available at: https://www.researchgate.net/publication/ 314196711_Uber_Aggressive_management_for_growth. OECD (2004). Indicateurs clés d’environnement de l’OCDE. Report [Online]. Available at: https://www.oecd.org/fr/env/indicateurs-modelisation-perspectives/31558903.pdf. OECD (2016). Protecting Consumers in Peer Platform Markets: Exploring the issues, 2016 Ministerial Meeting on the Digital Economy, Background Report. Report, OECD Digital Economy Papers no. 253 [Online]. Available at: https://unctad.org/meetings/en/ Contribution/dtl-eWeek2017c05-oecd_en.pdf. Ohlhorst, F. (2013). Big Data analytics: Turning Big Data into Big Money. John Wiley & Sons, Hoboken. Owyang, J. (2015). Large Companies Ramp up Adoption in the Collaborative Economy. Study, Collaborative Economy [Online]. Available at: http://www.web-strategist.com/ blog/2015/07/20/large-companies-ramp-up-adoption-in-the-collaborative-economy/. Owyang, J. (2016). Honeycomb 3.0: The collaborative economy market expansion. Study, Collaborative Economy [Online]. Available at: http://www.web-strategist.com/blog/ 2016/03/10/honeycomb-3-0-the-collaborative-economy-market-expansion-sxsw/. Pascal, C. (2002). La Nouvelle Sociologie économique et le lien marchand: des relations personnelles à l’impersonnalité des relations. Revue française de sociologie, 43(3), 521– 556 [Online]. Available at: https://www.jstor.org/stable/3322598. Pasquier, V., Daudigeos, T. (2016). L’économie collaborative, ce n’est vraiment pas le partage pour tous. Slate [Online]. Available at: http://www.slate.fr/story/119763/ promesse-economie-partage-fausse. Pekarskaya, M. (2015). Sharing Economy and Socio-Economic Transitions: An Application of the Multi-Level Perspective on a Case Study of Carpooling in the USA (1970-2010). Lund University, Lund [Online]. Available at: http://lup.lub.lu.se/student-papers/record/7869 049. Penard, T. (2014). Stratégies et modèles d’affaires des plateformes: principes et applications. Presentation, Université de Rennes 1 [Online]. Available at: https://innovation-regulation. telecom-paristech.fr/wp-content/uploads/2017/10/Seminaire2SMPenard.pdf?lang=en. Penn, L. (2016). La consommation collaborative au cœur de la plateforme Couchsurfing. Thesis, Université du Québec, Montreal [Online]. Available at: https://archipel.uqam.ca/ 9440/1/M14642.pdf. Peters, A. (2009). Radical collaboration. Site World changing. [Online]. Available at: http://www.thesweden.se/files/RC_FS_2013_ eng_web.pdf.
References
227
Poirier, Y. (2014). Économie sociale solidaire et concepts apparentés. Les origines et les définitions: une perspective internationale [Online]. Available at: http://www.ripess.org/ wp-content/uploads/2017/09/%C3%89conomie-solidaire-et-autres-concepts-PoirierJuillet-2014.pdf. Prahalad, C.K., Ramaswamy, V. (2004). Co-creation-Experiences: The next practice in value creation. Journal of Interactive Marketing, 18(3), 5–14. Preston, F.A. (2012). Global Redesign? Shaping the Circular Economy. The Royal Institute of International Affairs, London. PwC (2015). The Sharing Economy [Online]. Available at: https://www.pwc.fr/fr/ assets/files/pdf/2015/05/pwc_etude_sharing_economy.pdf. PwC (2016). Économie collaborative: prévision de 83 milliards d’euros de chiffre d’affaires en Europe d’ici 2025 [Online]. Available at: https://www.pwc.fr/fr/espace-presse/ communiques-depresse/2016/septembre/economiecollaborative-prevision-de-83-milliards deuros-ca.html. Quinlan, J.R. (1986). Induction of Decision Trees. Machine Learning, 1(1), 81–106. Quinlan, J.R. (1993). Programs for Machine Learning. Morgan Kaufmann Publishers, Burlington. Raies, A. (2012). La destruction créatrice contribue-t-elle à la croissance de l’efficience sectorielle? Une revue critique de la littérature théorique et empirique. Revue Valaque d’Études Économiques, 3(17) [Online]. Available at: https://www.researchgate.net/ publication/309528361_la_destruction_creatrice_contribue-t-elle_a_la_croissance_de_ l'efficience_sectorielle_une_revue_critique_de_la_litterature_theorique_et_empirique. Raimbault, N., Vétois, P. (2017). L’ “uberisation” de la logistique: disruption ou continuité ? Le cas de l’Île-de-France [Online]. Available at: https://www.openscience.fr/IMG/pdf/ iste_issn17v3n3.pdf. Ranjbari, M., Morales-Alonso, G., Carrasco-Gallego, R. (2018). Conceptualizing the Sharing Economy through Presenting a Comprehensive Framework. Sustainability, 10 [Online]. Available at: https://ideas.repec.org/a/gam/jsusta/v10y2018i7p2336-d156460.html. Ranzini, G., Etter, M., Lutz, C., Vermeulen, I. (2017). Privacy in the Sharing Economy. Report [Online]. Available at: https://papers.ssrn.com/sol3/papers.cfm?abstract_id= 2960942. Rebaud, A.-L. (2016). Économie circulaire et emploi: enjeux et perspectives. Notes d’analyse, Pour la solidarité [Online]. Available at: http://www.pour lasolidarite.eu/sites/ default/files/publications/files/na-2016-emplois-eco-circulai re.pdf. Reisch, L. (2008). Nature et culture de la consommation dans les sociétés de consommation. L’Économie politique, 39, 42–49. Ries, E. (2011). The Lean Startup: How Today’s Entrepreneurs Use Continuous Innovation to Create Radically Successful Businesses. Crown Publishing Group, New York.
228
Sharing Economy and Big Data Analytics
Rifkin, J. (2014). The zero cost marginal society. The Internet of things, the collaborative commons, and the eclipse of capitalism. Palgrave Macmillan, New York. Ritzer, G., Jurgenson, N. (2010). Production, Consumption, Prosumption. Journal of Consumer Culture, 10(1), 13–36 [Online]. Available at: https://doi.org/10.1177%2F1469 540509354673. Robert, I., Binninger, A.S., Ourahmoune, N. (2014). La consommation collaborative, le versant encore équivoque de l’économie de la fonctionnalité. Développement durable et territoires, 5(1). Rogers, S. (2011). Big Data is Scaling BI and Analytics. Information Management Magazine. Rolland-Piégue, E. (2011). La responsabilité sociale des entreprises au Japon, de l’époque d’Edo à la norme ISO 26 000 et à l’accident nucléaire de Fukushima. Réalités industrielles, 2, 33–37. Roose, R. (2014). The Sharing Economy Isn’t About Trust, It’s About Desperation. New York Magazine [Online]. Available at: http://nymag.com/daily/intelligencer/2014/04/sharingeconomy-is-about-desperation.html. Samuel, A.L. (1959). Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development, 3(3), 210–229. SAS (2017). Predictive Analytics: What it is and why it matters [Online]. Available at: https://www.sas.com/en_us/insights/analytics/predictive-analytics.html. Saussereau, L. (2012). Vers l’économie du don. Les Echos.fr [Online]. Available at: http://archives.lesechos.fr/archives/cercle/2012/03/28/cercle_45060.htm. Sauvé, S., Normandin, D., McDonald, M. (2016). L’économie circulaire. Les Presses de l’Université de Montréal, Montreal. Schaefer, D. (2014). Débrider l’innovation: enjeux pour les entreprises et l’emploi, défi pour les politiques publiques. Report, CCI Paris Île-de-France [Online]. Available at: https://www.cci.fr/documents/10988/8024289/CCIPdebrider-innovation.pdf. Schollmeier, R. (2001). A Definition of Peer-to-Peer Networking for the Classification of Peer-to-Peer Architectures and Applications. In Proceedings of the First International Conference on Peer-to-Peer Computing (P2P’01) [Online]. Available at: https://www. researchgate.net/publication/3940901ADefinitionofPeer-to-PeerNetworking_for_the_ Classification_of_PeertoPeer_Architectures _and_Applications. Schor, J.B. (2011). True Wealth: How and Why Millions of Americans Are Creating a Timerich, ecologically light, Smallscale, High-satisfaction Economy. The Penguin Press, New York. Schor, J.B. (2014). Debating the Sharing Economy. Great Transition Initiative [Online]. Available at: https://www.greattransition.org/images/GTIpublications/Schor_Debating_ the_Sharing_Economy.pdf.
References
229
Schor, J.B. et al. (2014). Paradoxes of Openness and Distinction in the Sharing Economy. Poetics, 54, 66–81. Sedkaoui, S. (2018a). Data analytics and big data. ISTE Ltd, London and Wiley, New York. Sedkaoui, S. (2018b). Big Data analytics for Entrepreneurial Success: Emerging Research and Opportunities. IGI Global, New York. Sedkaoui, S., Gottinger, H.W. (2017). The Internet, Data Analytics and Big Data. In Internet Economics: Models, Mechanisms and Management, Gottinger, H.W. (ed.), 144–166. eBook Bentham Science Publishers, Charjah. Sedkaoui, S., Khelfaoui, M. (2019). Understand, develop and enhance the learning process with big data. Information Discovery and Delivery, 47(1), 2–16. Sedkaoui, S., Monino, J.L. (2016). Big data, Open Data and Data Development. ISTE Ltd, London and Wiley, New York. Shmueli, G., Koppius, O.R. (2011). Predictive analytics in information systems research. MIS Quarterly, 35(3), 553–572. Sidoli, Y. (2017). L’usage en partage: Analyse comparative des modèles socio-économiques d’ “économie de (la) fonctionnalité” et d’ “économie collaborative”. PhD thesis, Université Côte d’Azur [Online]. Available at: https://tel.archives-ouvertes.fr/tel01516611/document. Simon, H. (1977). The New Science of Management Decision. Prentice Hall, Englewood Cliffs. Simon, H. (1997). Administrative behavior: A study of decision-making processes in administrative organizations. Free Press, New York. Smiechowski, B.C. (2016). Data as a common in the sharing economy: a general policy proposal. Document de travail no. 2016-10 du CEPN. Soldatos, J. (2017). Building Blocks for IoT Analytics Internet-of-Things Analytics. River Publishers Series in Signal, Image and Speech Processing, Gistrup. Srnicek, N. (2017). Platform Capitalism. Polity Press, Cambridge. Stadler, F., Stülzl, W. (2011). Ethics of Sharing. International Review of Information Ethics, 15. Stahel, W. (2006). The Performance Economy. Palgrave Macmillan, Basingstoke. Stanoevska-Slabeva, K., Lenz-Kesekamp, V., Suter, V. (2016). Platforms and the Sharing Economy. Report [Online]. Available at: https://www.bi.edu/globa lassets/forskning/h 2020/ps2share_platformanalysispaper_final.pdf. Steiner, P. (2005). Le marché selon la sociologie économique. Revue européenne des sciences sociales [Online]. Available at: http://ress.revues.org/326. Steiner, P. (2012). Marché, transaction et liens sociaux: l’approche de la sociologie économique. Revista de Sociologia e Política, 20(42), 111–119.
230
Sharing Economy and Big Data Analytics
Stemler, A. (2017). The myth of the sharing economy and its implications for regulating innovation. Emory Law Journal, 67(197), 197–241. Supiot, A. (1993). Le travail, liberté partagée. Droit social, 9, 715–725. Sutherland, W., Jarrahi, M.H. (2018). The Sharing Economy and Digital Platforms: A Review and Research Agenda. International Journal of Information Management, 43, 328–341. Swedberg, R., Smelser, N.J. (1994). The Handbook of Economic Sociology. Princeton University Press [Online]. Available at: https://www.researchgate.net/publication/23536 2932The_Handbook_of_Economic_Sociology. Tapscott, D., Williams, A.D. (2006). Wikinomics: How Mass Collaboration Changes Everything. Portfolio, Penguin Group Inc., New York. Terrasse, P. (2016). Rapport au Premier Ministre sur l’économie collaborative. Report [Online]. Available at: https://www.gouvernement.fr/sites/default/files/document/ document/2016/02/08.02.2016rapportaupremierministresurleconomiecollaborative.pdf. Torfs, A. (2016a). Quelles perspectives d’évolution par rapport aux acteurs conventionnels? Étude de cas – Airbnb. Research Thesis, Louvain School of Management, Louvain-laNeuve [Online]. Available at: https://dial.uclouvain.be/memoire/ucl/en/object/thesis%3A 7228/datastream/PDF_01/view. Torfs, A. (2016b). Économie collaborative – Modèle économique alternatif et disruptif. Study, Louvain School of Management, Louvain-la-Neuve. Utz, S., Kerkhof, P., Van den Bos, J. (2012). Consumers rule: How consumer reviews influence perceived trustworthiness of online stores. Electronic Commerce Research and Applications, 11(1), 49–58. Vaileanu-Paun, I., Boutillier, S. (2012). Économie de la fonctionnalité: une nouvelle synergie entre le territoire, la firme et le consommateur? Innovations, 37, 95–125. Van de Walle, I., Hebel, P., Siounandan, N. (2012). Cahier de recherche – Les secondes vies des objets: les pratiques d’acquisition et de délaissement des produits de consommation. Report no. 290, CREDOC. Van Neil, J. (2007). L’économie de fonctionnalité: définition et état de l’art [Online]. Available at: http://economiedefonctionnalite.fr/wp-content/uploads/2010/04/definition_ et_etat_de_lart-Johan-Van-Niel.pdf. Van Neil, J. (2014). L’économie de fonctionnalité: principes, éléments de terminologie et proposition de typologie. Développement durable et territoires, 5(1) [Online]. Available at: http://developpementdurable.revues.org/10160. Verdier, H., Colin, N. (2015). L’Âge de la multitude: Entreprendre et comprendre après la révolution numérique, 2nd edition. Armand Colin, Paris. Vermillion, S. (2015). Three PR leadership lessons from Dale Carnegie [Online]. Available at: https://stephanievermillion.com/2015/04/15/three-pr-leadership-lessons-from-dalecarnegie/.
References
231
Verne, C.D., Meier, O. (2013). Culture et éthique. Regard sur le Japon et les entreprises japonaises. VA Press, Paris [Online]. Available at: https://www.vapress.fr/attachment/ 476983/. Waller, M.A., Fawcett, S.E. (2013). Data science, predictive analytics, and big data: A revolution that will transform supply chain design and management. Journal of Business Logistics, 34, 77–84. Wasko, M.M., Faraj, S. (2005). Why should I share? Examining social capital and knowledge contribution in electronic networks of practice. MIS Quarterly, 29(1), 35–57. Watt, J., Borhani, R., Katsaggelos, A. (2016). Machine Learning Refined: Foundations, Algorithms, and Applications. Cambridge University Press, Cambridge. Williams, C.C., Windebank, J. (2005). Refiguring the nature of undeclared work: Some evidence from England. European Societies, 7(1), 81–102 [Online]. Available at: https://doi.org/10.1080/1461669042000327036. Winkle, J., Reitsma, R., Fleming, G., Duan, X., Collins, C., Birrell, R. (2018). Millennials Drive the Sharing Economy [Online]. Available at: https://www.forrester.com/ report/Millennials+Drive+The+Sharing+Economy/-/E-RES14 1974. Winkler, H. (2017). À qui profite l’économie du partage en Europe? La Banque mondiale [Online]. Available at: https://blogs.worldbank.org/voices/fr/qui-profite-l-economie-dupartage-en-europe. Witten, I.H., Frank, E., Hall, M.A., Pal, C.P. (2016). Data Mining: Practical Machine Learning Tools and Techniques. Elsevier, Amsterdam. Wu, L., Brynjolfsson, E. (2015). The Future of Prediction: How Google Searches Foreshadow Housing Prices and Sales. In Economic Analysis of the Digital Economy, Goldfarb, A., Greenstein, S.M., Tucker, C.E. (eds). University of Chicago Press, Chicago. Yang, S. (2018). Analytics pipeline at Lyft [Online]. Available at: https://cdn.oreillystatic. com/en/assets/1/event/269/Lyft_s%20analytics%20pipeline_%20From%20Redshift%20to %20Apache%20Hive%20and%20Presto%20Presentation.pdf. Yaraghi, N., Ravi, S. (2017). The Current and Future State of the Sharing Economy. Brookings Institution India Center. Zahadat, N., Partridge, W. (2018). Blockchain: A Critical Component to Ensuring Data Security. Journal of Forensic Sciences & Criminal Investigation, 10(1) [Online]. Available at: https://juniperpublishers.com/jfsci/pdf/JFSCI.MS.ID.555780.pdf.
Index
A ADEME, 18, 22, 23 Airbnb in Paris, 162, 163, 169 listings.csv, 153, 185 algorithm, 9, 84, 103, 106, 123, 125, 133, 137, 143–145, 148, 152, 168, 172–177, 179–183, 185–188, 190, 192, 193, 195–204, 206, 207, 209– 213 classification, 171–173, 175, 176, 179, 182, 183, 185, 193, 194, 197, 211 regression, 177 supervised, 123, 140 unsupervised, 123, 193 Alibaba, 42, 77, 138 Amazon Echo, 104 Web Services (AWS), 104, 197 analysis data, 71, 76, 82–84, 86, 88–90, 92, 93, 97, 98, 100, 101, 103–106, 116, 117, 122–126, 128–130, 133, 135, 137–140, 143, 144, 146, 153, 154, 160, 166, 169, 171–173, 190, 193, 202, 213 descriptive, 83, 84, 86, 87, 95, 98
exploratory, 94, 95, 97, 153, 159, 160, 166, 184 predictive, 83–85, 87, 98, 132 prescriptive, 83, 85, 86, 87, 88, 98 sentiment, 85, 125, 128, 176 Apple, 20, 128 Artificial Intelligence (AI), 70, 104, 130, 135 awareness, 3 ecological, 3
B benefit, 23, 36, 88, 108 Big Data 3V’s, 64, 65, 68, 70, 71, 78, 83, 101, 105, 125 analytics Business Intelligence (BI), 75, 89 data science, 84 applications, 103, 130, 131, 133, 139 databases, 67, 71, 72, 76, 82, 84, 85, 91–93, 110, 125, 126, 128, 132, 136, 152 technologies Apache Spark, 127–129 Hadoop, 71, 125–128, 134, 140 MapReduce, 71, 126–129, 140 NoSQL, 127, 128
Sharing Economy and Big Data Analytics, First Edition. Soraya Sedkaoui and Mounia Khelfaoui. © ISTE Ltd 2020. Published by ISTE Ltd and John Wiley & Sons, Inc.
234
Sharing Economy and Big Data Analytics
Bitcoin, 53, 54, 110 BlaBlaCar, 58, 77, 99, 100, 107–112, 116, 120, 121, 130, 133, 134, 140, 150, 169 A/B testing, 134 carpooling, 8, 51, 58, 100, 120, 121, 134, 138 bootstrap, see also bagging (classified under “branching point”), 181 branching point bagging see also bootstrap, 181 entropy, 174, 175, 187 information gain, 175 node, 174, 185 business models, 29, 36, 40, 41, 59, 75, 77–79, 100, 106–108, 115, 120, 139 data-driven business model (DDBM), 78 new economic models, 56, 59 sharing economy, 100 business transactions, 25, 33, 34 B2B, 32, 33, 109 C2B, 32, 33 C2C, 32–34, 109 C2G, 32, 33
C capacity calculation, 69–73 real-time, 45, 64, 68, 69, 71, 72, 82, 83, 88, 91, 102, 106, 110, 114, 128–133, 138, 176 storage cloud computing, 74, 106, 132 Moore’s law, 73 classification attribute see also entity, 174, 175, 176, 177, 214
class, 12, 94, 123, 171, 174–177, 180, 181, 183, 185, 194, 195, 198, 199, 202, 203, 212 hierarchical, 195, 196, 200, 201, 205–208, 214 agglomerative, 206 complete linkage, 206, 207 dendogram, 207, 208 divisive, 206 single linkage, 206, 207 Ward criterion, 207, 208 label, 172, 174, 176, 185, 195 similarity, 123, 190, 195, 198–201 clustering, 95, 103, 195–202, 204, 205, 208, 213 centroid, 201–203, 205 cluster, 134, 195, 196, 200–207, 209–212 cluster inertia, 204 number of clusters, 203–207, 210– 213 similarity measure, 198, 206 Coase, 56 collaborative consumption, 6–8, 34 consumer society, 7 cooperation, 17, 18, 22, 23, 130, 25, 52, 108, 109 cooperativism, 111 cooperative, 4, 5 correlation, 82, 94, 95, 101, 102, 175, 187, 215, 143, 153, 160, 161, 163, 169 Pearson correlation coefficient, 161 crisis, 3, 8, 19, 20, 24, 27, 29, 108 economic, 3, 8, 20, 27
D data qualitative, 65 quality, 69, 76, 91, 115 quantitative, 65 quantity, 73, 115, 205 semi-structured, 70, 110, 111 structured, 92, 127, 133, 171 unstructured, 83, 92, 109, 128
Index
235
science (classified under “Big Data Analytics), 82, 84 security, 111, 113, 114, 117, 129, 138 different forms, 63, 70 visualisation (and data visualization), 95, 130 data-sets, 69, 70, 72, 78, 82, 85, 105, 114, 115, 119, 121, 127, 132, 150, 168, 171, 172, 193, 194, 195, 200, 202, 205, 208, 213 DataFrame, 209 decoupling, 20, 21 absolute, 21 relative, 21 distance, 148, 168, 178–180, 190, 191, 198–200, 202–204, 206, 208, 211, 214 Euclidien, 190, 191, 199, 204, 208 Manhattan, 199
F
E
HBase, 71, 127, 129 Heetch, 58 hidden layers, 182 Hive, 127, 128, 134, 135 honeycomb, 34 3.0, 35 IBM, 105, 106, 128, 129 Watson, 106 images see also photos, 66, 95, 104, 105, 124, 133, 135, 181, 196 Industrial Revolution, 4, 7, 39 industrialization, 19 information, 7, 10, 48, 53, 54, 57, 58, 70, 75, 95, 175, 197, 202, 210 knowledge, 31 innovation, 22, 26, 40, 42–47, 52, 119, 130 disruptive, 43, 44, 46, 47 Instagram, 53, 66, 73
e-commerce, 102, 103, 126 ecology, 9, 20, 24, 26 economic activity, 3, 15, 17, 19, 20, 24, 59 growth, 21, 26 sociology, 13, 15, 16 economy sharing collaborative, 3, 4, 6, 8, 30, 32-37, 40, 45, 48, 49, 51, 54, 57, 59 platform, 6, 9 service economy, 18–23 entity see also attribute (classified under “classification”), 175, 176, 178, 198, 201 errors, 93, 94, 147, 204 ethics, 13, 25, 112 externalities, 3, 19, 21, 36 negative, 19, 21 positive, 36
Facebook, 32, 42, 53, 57, 64, 66, 68, 73, 100–105, 127, 128 cookies, 105 Flink, 129, 134, 135
G Gartner, 64, 70 gift, 8, 10, 13, 14, 15, 16, 17, 18, 23 Mauss, 13, 14 goods, 4, 5, 7, 8, 14, 18–20, 22, 23, 27, 29, 30, 32, 33, 42, 47, 51–59, 85, 100, 108, 109, 111–113, 116, 120, 122, 145 Google, 32, 57, 64–66, 77, 101, 103, 126, 127 Gradient Descent, 147
H, I
236
Sharing Economy and Big Data Analytics
Internet of Things (IoT), 70, 82, 110 connected objects, 66, 68–70, 72, 73, 105, 108, 112, 135 GPS, 69, 135 smart watches, 69, 73 smartphones, 3, 41, 51, 69, 73, 105, 132 tablets, 3, 45, 51, 69
J, K Java, 128 k-means, 195, 196, 200–205, 209– 214 elbow, 210, 211, 212 elbow criterion, 210, 211 k clusters, 203 k-Nearest Neighbors (kNN), 172, 179, 190, 193 Kafka, 134, 135 knowledge discovery in databases (KDD), 89
L Lasso, 149, 151, 152, 167, 168, 170, 184, 187, 189 LaZooz, 138 blockchain, 110, 129, 130 linear, 144–146, 152, 154, 167, 168, 169, 177, 179, 184, 187, 205 Lyft, 99, 100, 107, 111, 116, 134, 135, 139
M, N Machine Learning, 70, 84, 86, 98, 104, 121–125, 128, 129, 131, 135, 138, 139, 140, 143, 147, 166, 172, 193, 196 algorithms supervised, 123, 140 unsupervised, 123, 193
Mean Squared Error (MSE), 148, 149, 154, 168, 189 Michelin, 20 Microsoft, 20, 129 Mobike, 138 modeling, 8, 71, 75, 89, 92, 93, 95, 96, 98, 103, 126, 144, 146, 151– 154, 158, 160, 166, 167, 181, 185, 210, 211 model’s performance, 92, 153, 167, 185 predictive model, 84, 86, 96, 153, 169, 193 multicollinearity, 149–151, 160, 167, 170 Naïve Bayes, 124, 172, 175, 176, 183, 193, 194 Bayes’ theorem, 175, 176 conditional independence, 176 conditional probability, 176 natural environment, 50 Natural Language Processing (NLP), 106 Netflix, 53, 57, 64, 66, 77, 100–103, 197 House of Cards, 102, 103 networks computer, 10, 13 neural, 124, 172, 179, 181, 182, 193, 194 Deep Learning, 181 social, 7, 10, 52, 53, 76, 91, 102, 103, 104, 110, 134 NumPy, 153, 185, 209
O Ofo, 138 Open Data, 91, 114, 116 open source, 127 ordinary least squares (OLS), 147 overexploitation of resources, 21
Index
237
P
Q, R
pandas, 153, 160 185, 209 parameters, 95, 147, 148, 154, 167, 168, 181, 185, 208, 209 robust, 97, 124, 168 PayPal, 55, 57 peer-to-peer (P2P), 6, 10–13, 23, 46, 55, 109, 125 Napster, 11 performance economic, 11 environmental, 22 petabytes, 69, 72, 102, 110 photos see also images, 66–68, 91, 105, 111, 112, 117, 135 platforms collaborative, 30, 31, 57 digital, 9, 10, 20, 29, 30, 39, 43, 45, 51–55, 57–, 60, 77, 99, 100, 106–109 poverty, 29, 30, 31, 32 process data analysis, 86, 89, 97, 98, 103, 124, 126, 135 data collection, 91, 98 data preparation, 94, 97, 98, 153, 154, 158, 169 deployment, 96–98, 125, 138, 169 decision-making, 64, 69, 74, 75, 76, 89, 101, 102, 116, 133 Simon’s model, 76 value creation, 75–77, 100, 110, 116 Product Service System (PSS), 18, 22 PwC, 26, 120 Python, 71, 125, 126, 128, 129, 144, 152, 153, 158, 185, 208, 209, 213 libraries, 153, 160, 185, 209
Quora, 99, 137 R, 71, 125, 128, 129 Random Forest, 124, 169, 172, 179, 180, 181, 184, 185, 187–189, 190, 193, 194 regression linear, 143, 144, 146–154, 159, 167–169, 171, 184, 190 logistic, 149, 150, 152, 170, 172, 177 function, 150, 151 Ridge, 149, 151, 152, 167, 168, 184 sigmoid function, 150 relations economic, 17, 54 social, 12, 14, 15, 16, 25, 59 Return on Investment (ROI), 77, 87 Root Mean Squared Error (RMSE), 168, 190, 184, 187, 189, 192, 193
S Scikit Learn, 209 sense of satisfaction, 31 sentiment analysis, 85, 125, 128, 137 server, 10, 12, 28 sharing economy companies, 52, 100, 107, 110, 111, 114, 121, 122, 125, 126, 127, 130, 131, 134, 138, 139, 193, 194, 196, 197, 200, 215 social expectations, 58 socio-economics, 8, 10, 26, 27 software, 6, 76, 110, 140, 170, 194 start-up, 36, 41, 42, 45, 50, 58, 78, 107, 120, 126, 134 Support Vector Machine (SVM), 124, 172, 177–179, 183, 193, 194 gap, 177, 178, 179 support vectors, 178, 179
238
Sharing Economy and Big Data Analytics
surge pricing (classified under “Uber”), 132 sustainability, 27, 37, 49, 50
T TaskRabbit, 99, 137 Taskers, 137 technology digital, 5, 44, 56 mobile, 3, 6, 7, 130 transaction costs, 9, 55, 56, 115 trends, 136 trust, 9, 22, 23, 26, 30, 49, 50, 55, 108, 109, 115, 130 Twitter, 53, 57, 66, 102, 103
U, V Uber, 39–44, 46, 47, 51, 58, 66, 77, 99, 100, 107–112, 116, 150, 169, 120, 121, 124, 130–134, 139, 140 Movement, 132 surge pricing, 132 Ubereats, 51 uberization, 39–41, 44–46, 50 value, 8, 14, 18–23, 28, 37, 41, 47, 52, 53, 55, 57, 58, 65, 69–72, 75– 77, 82, 89, 91–93, 97, 100, 101, 107, 110, 114, 115, 117, 119–123, 130, 132, 133, 140, 144–146, 148, 150, 152, 154, 159, 161, 171–174, 178, 180, 182, 185, 187, 189, 190, 192, 193, 200, 215. 202, 204, 211 human, 13, 16 missing, 92, 156, 158, 184
outlier, 91, 93 social, 17, 58 variable dependent, 143–146, 149, 150, 154, 158, 162, 167 independent, 143–146, 149, 150, 158, 160, 166, 167, 184 vector, 81, 178, 179, 198, 199, 200, 202 videos, 9, 10, 11, 57, 64, 66–68, 73, 91, 104, 110, 181, 197 virtual, 15, 52–55, 104, 115
W Walmart, 101, 102, 103 waste, 47 reduction, 49 web 1.0, 10 2.0, 10 well-being, 17, 19, 20, 21, 24, 37, 40 social, 24, 37, 40 WhatsApp, 53, 57 Wikinomic, 28
X, Y, Z Xerox, 20, 44, 46 YARN, 128 Yelp, 135, 136, 137 text analytics, 137 Yelp.com, 136 yottaoctets, 69 YouTube, 28, 53, 57, 64, 66, 73, 197 zettaoctets, 69, 72, 110
Other titles from
in Information Systems, Web and Pervasive Computing
2019 ALBAN Daniel, EYNAUD Philippe, MALAURENT Julien, RICHET Jean-Loup, VITARI Claudio Information Systems Management: Governance, Urbanization and Alignment AUGEY Dominique, with the collaboration of ALCARAZ Marina Digital Information Ecosystems: Smart Press BATTON-HUBERT Mireille, DESJARDIN Eric, PINET François Geographic Data Imperfection 1: From Theory to Applications BRIQUET-DUHAZÉ Sophie, TURCOTTE Catherine From Reading-Writing Research to Practice BROCHARD Luigi, KAMATH Vinod, CORBALAN Julita, HOLLAND Scott, MITTELBACH Walter, OTT Michael Energy-Efficient Computing and Data Centers CHAMOUX Jean-Pierre The Digital Era 2: Political Economy Revisited COCHARD Gérard-Michel Introduction to Stochastic Processes and Simulation
DUONG Véronique SEO Management: Methods and Techniques to Achieve Success GAUCHEREL Cédric, GOUYON Pierre-Henri, DESSALLES Jean-Louis Information, The Hidden Side of Life GEORGE Éric Digitalization of Society and Socio-political Issues 1 GHLALA Riadh Analytic SQL in SQL Server 2014/2016 JANIER Mathilde, SAINT-DIZIER Patrick Argument Mining: Linguistic Foundations SOURIS Marc Epidemiology and Geography: Principles, Methods and Tools of Spatial Analysis TOUNSI Wiem Cyber-Vigilance and Digital Trust: Cyber Security in the Era of Cloud Computing and IoT
2018 ARDUIN Pierre-Emmanuel Insider Threats (Advances in Information Systems Set – Volume 10) CARMÈS Maryse Digital Organizations Manufacturing: Scripts, Performativity and Semiopolitics (Intellectual Technologies Set – Volume 5) CARRÉ Dominique, VIDAL Geneviève Hyperconnectivity: Economical, Social and Environmental Challenges (Computing and Connected Society Set – Volume 3) CHAMOUX Jean-Pierre The Digital Era 1: Big Data Stakes
DOUAY Nicolas Urban Planning in the Digital Age (Intellectual Technologies Set – Volume 6) FABRE Renaud, BENSOUSSAN Alain The Digital Factory for Knowledge: Production and Validation of Scientific Results GAUDIN Thierry, LACROIX Dominique, MAUREL Marie-Christine, POMEROL Jean-Charles Life Sciences, Information Sciences GAYARD Laurent Darknet: Geopolitics and Uses (Computing and Connected Society Set – Volume 2) IAFRATE Fernando Artificial Intelligence and Big Data: The Birth of a New Intelligence (Advances in Information Systems Set – Volume 8) LE DEUFF Olivier Digital Humanities: History and Development (Intellectual Technologies Set – Volume 4) MANDRAN Nadine Traceable Human Experiment Design Research: Theoretical Model and Practical Guide (Advances in Information Systems Set – Volume 9) PIVERT Olivier NoSQL Data Models: Trends and Challenges ROCHET Claude Smart Cities: Reality or Fiction SAUVAGNARGUES Sophie Decision-making in Crisis Situations: Research and Innovation for Optimal Training SEDKAOUI Soraya Data Analytics and Big Data
SZONIECKY Samuel Ecosystems Knowledge: Modeling and Analysis Method for Information and Communication (Digital Tools and Uses Set – Volume 6)
2017 BOUHAÏ Nasreddine, SALEH Imad Internet of Things: Evolutions and Innovations (Digital Tools and Uses Set – Volume 4) DUONG Véronique Baidu SEO: Challenges and Intricacies of Marketing in China LESAS Anne-Marie, MIRANDA Serge The Art and Science of NFC Programming (Intellectual Technologies Set – Volume 3) LIEM André Prospective Ergonomics (Human-Machine Interaction Set – Volume 4) MARSAULT Xavier Eco-generative Design for Early Stages of Architecture (Architecture and Computer Science Set – Volume 1) REYES-GARCIA Everardo The Image-Interface: Graphical Supports for Visual Information (Digital Tools and Uses Set – Volume 3) REYES-GARCIA Everardo, BOUHAÏ Nasreddine Designing Interactive Hypermedia Systems (Digital Tools and Uses Set – Volume 2) SAÏD Karim, BAHRI KORBI Fadia Asymmetric Alliances and Information Systems:Issues and Prospects (Advances in Information Systems Set – Volume 7)
SZONIECKY Samuel, BOUHAÏ Nasreddine Collective Intelligence and Digital Archives: Towards Knowledge Ecosystems (Digital Tools and Uses Set – Volume 1)
2016 BEN CHOUIKHA Mona Organizational Design for Knowledge Management BERTOLO David Interactions on Digital Tablets in the Context of 3D Geometry Learning (Human-Machine Interaction Set – Volume 2) BOUVARD Patricia, SUZANNE Hervé Collective Intelligence Development in Business EL FALLAH SEGHROUCHNI Amal, ISHIKAWA Fuyuki, HÉRAULT Laurent, TOKUDA Hideyuki Enablers for Smart Cities FABRE Renaud, in collaboration with MESSERSCHMIDT-MARIET Quentin, HOLVOET Margot New Challenges for Knowledge GAUDIELLO Ilaria, ZIBETTI Elisabetta Learning Robotics, with Robotics, by Robotics (Human-Machine Interaction Set – Volume 3) HENROTIN Joseph The Art of War in the Network Age (Intellectual Technologies Set – Volume 1) KITAJIMA Munéo Memory and Action Selection in Human–Machine Interaction (Human–Machine Interaction Set – Volume 1) LAGRAÑA Fernando E-mail and Behavioral Changes: Uses and Misuses of Electronic Communications
LEIGNEL Jean-Louis, UNGARO Thierry, STAAR Adrien Digital Transformation (Advances in Information Systems Set – Volume 6) NOYER Jean-Max Transformation of Collective Intelligences (Intellectual Technologies Set – Volume 2) VENTRE Daniel Information Warfare – 2nd edition VITALIS André The Uncertain Digital Revolution (Computing and Connected Society Set – Volume 1)
2015 ARDUIN Pierre-Emmanuel, GRUNDSTEIN Michel, ROSENTHAL-SABROUX Camille Information and Knowledge System (Advances in Information Systems Set – Volume 2) BÉRANGER Jérôme Medical Information Systems Ethics BRONNER Gérald Belief and Misbelief Asymmetry on the Internet IAFRATE Fernando From Big Data to Smart Data (Advances in Information Systems Set – Volume 1) KRICHEN Saoussen, BEN JOUIDA Sihem Supply Chain Management and its Applications in Computer Science NEGRE Elsa Information and Recommender Systems (Advances in Information Systems Set – Volume 4) POMEROL Jean-Charles, EPELBOIN Yves, THOURY Claire MOOCs
SALLES Maryse Decision-Making and the Information System (Advances in Information Systems Set – Volume 3) SAMARA Tarek ERP and Information Systems: Integration or Disintegration (Advances in Information Systems Set – Volume 5)
2014 DINET Jérôme Information Retrieval in Digital Environments HÉNO Raphaële, CHANDELIER Laure 3D Modeling of Buildings: Outstanding Sites KEMBELLEC Gérald, CHARTRON Ghislaine, SALEH Imad Recommender Systems MATHIAN Hélène, SANDERS Lena Spatio-temporal Approaches: Geographic Objects and Change Process PLANTIN Jean-Christophe Participatory Mapping VENTRE Daniel Chinese Cybersecurity and Defense
2013 BERNIK Igor Cybercrime and Cyberwarfare CAPET Philippe, DELAVALLADE Thomas Information Evaluation LEBRATY Jean-Fabrice, LOBRE-LEBRATY Katia Crowdsourcing: One Step Beyond SALLABERRY Christian Geographical Information Retrieval in Textual Corpora
2012 BUCHER Bénédicte, LE BER Florence Innovative Software Development in GIS GAUSSIER Eric, YVON François Textual Information Access STOCKINGER Peter Audiovisual Archives: Digital Text and Discourse Analysis VENTRE Daniel Cyber Conflict
2011 BANOS Arnaud, THÉVENIN Thomas Geographical Information and Urban Transport Systems DAUPHINÉ André Fractal Geography LEMBERGER Pirmin, MOREL Mederic Managing Complexity of Information Systems STOCKINGER Peter Introduction to Audiovisual Archives STOCKINGER Peter Digital Audiovisual Archives VENTRE Daniel Cyberwar and Information Warfare
2010 BONNET Pierre Enterprise Data Governance BRUNET Roger Sustainable Geography
CARREGA Pierre Geographical Information and Climatology CAUVIN Colette, ESCOBAR Francisco, SERRADJ Aziz Thematic Cartography – 3-volume series Thematic Cartography and Transformations – Volume 1 Cartography and the Impact of the Quantitative Revolution – Volume 2 New Approaches in Thematic Cartography – Volume 3 LANGLOIS Patrice Simulation of Complex Systems in GIS MATHIS Philippe Graphs and Networks – 2nd edition THÉRIAULT Marius, DES ROSIERS François Modeling Urban Dynamics
2009 BONNET Pierre, DETAVERNIER Jean-Michel, VAUQUIER Dominique Sustainable IT Architecture: the Progressive Way of Overhauling Information Systems with SOA PAPY Fabrice Information Science RIVARD François, ABOU HARB Georges, MERET Philippe The Transverse Information System ROCHE Stéphane, CARON Claude Organizational Facets of GIS
2008 BRUGNOT Gérard Spatial Management of Risks FINKE Gerd Operations Research and Networks
GUERMOND Yves Modeling Process in Geography KANEVSKI Michael Advanced Mapping of Environmental Data MANOUVRIER Bernard, LAURENT Ménard Application Integration: EAI, B2B, BPM and SOA PAPY Fabrice Digital Libraries
2007 DOBESCH Hartwig, DUMOLARD Pierre, DYRAS Izabela Spatial Interpolation for Climate Data SANDERS Lena Models in Spatial Analysis
2006 CLIQUET Gérard Geomarketing CORNIOU Jean-Pierre Looking Back and Going Forward in IT DEVILLERS Rodolphe, JEANSOULIN Robert Fundamentals of Spatial Data Quality