1,060 73 5MB
English Pages 334 [335]
HANDBOOK OF BIG DATA RESEARCH METHODS
RESEARCH HANDBOOKS IN INFORMATION SYSTEMS This new and exciting series brings together authoritative and thought-provoking contributions on the most pressing topics and issues in Information Systems. Handbooks in the series feature specially commissioned chapters from eminent academics, each overseen by an Editor internationally recognized as a leading name within the field. Chapters within the Handbooks feature comprehensive and cutting-edge research, and are written with a global readership in mind. Equally useful as reference tools or high-level introductions to specific topics, issues, methods and debates, these Research Handbooks will be an essential resource for academic researchers and postgraduate students.
Handbook of Big Data Research Methods Edited by
Shahriar Akter Faculty of Business and Law, University of Wollongong, Australia
Samuel Fosso Wamba Department of Information, Operations and Management Sciences, TBS Business School, France
RESEARCH HANDBOOKS IN INFORMATION SYSTEMS
Cheltenham, UK • Northampton, MA, USA
© Editors and Contributors Severally 2023
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system or transmitted in any form or by any means, electronic, mechanical or photocopying, recording, or otherwise without the prior permission of the publisher. Published by Edward Elgar Publishing Limited The Lypiatts 15 Lansdown Road Cheltenham Glos GL50 2JA UK Edward Elgar Publishing, Inc. William Pratt House 9 Dewey Court Northampton Massachusetts 01060 USA A catalogue record for this book is available from the British Library Library of Congress Control Number: 2023901410 This book is available electronically in the Business subject collection http://dx.doi.org/10.4337/9781800888555
ISBN 978 1 80088 854 8 (cased) ISBN 978 1 80088 855 5 (eBook)
EEP BoX
Contents
List of contributorsvii 1
Introduction to the Handbook of Big Data Research Methods1 Shahriar Akter, Samuel Fosso Wamba, Shahriar Sajib and Sahadat Hossain
2
Big data research methods in financial prediction Md Lutfur Rahman and Shah Miah
11
3
Big data, data analytics and artificial intelligence in accounting: an overview Sudipta Bose, Sajal Kumar Dey and Swadip Bhattacharjee
32
4
The benefits of marketing analytics and challenges Madiha Farooqui
52
5
How big data analytics will transform the future of fashion retailing Niloofar Ahmadzadeh Kandi
72
6
Descriptive analytics and data visualization in e-commerce P.S. Varsha and Anjan Karan
86
7
Application of big data Bayesian interrupted time-series modeling for intervention analysis Neha Chaudhuri and Kevin Carillo
8
How predictive analytics can empower your decision making Nadia Nazir Awan
9
Gaussian process classification for psychophysical detection tasks in multiple populations (wide big data) using transfer learning Hossana Twinomurinzi and Herman C. Myburgh
10
Predictive analytics for machine learning and deep learning Tahajjat Begum
148
11
Building a successful data science ecosystem using public cloud Mohammad Mahmudul Haque
165
12
How HR analytics can leverage big data to minimise employees’ exploitation and promote their welfare for sustainable competitive advantage Kumar Biswas, Sneh Bhardwaj and Sawlat Zaman
179
13
Embracing Data-Driven Analytics (DDA) in human resource management to measure the organization performance P.S. Varsha and S. Nithya Shree
195
v
105 117
128
vi Handbook of big data research methods 14
A process framework for big data research: social network analysis using design science Denis Dennehy, Samrat Gupta and John Oredo
15
Notre-Dame de Paris cathedral is burning: let’s turn to Twitter Serge Nyawa, Dieudonné Tchuente and Samuel Fosso Wamba
16
Does personal data protection matter in data protection law? A transformational model to fit in the digital era Gowri Harinath
17
Understanding the Future trends and innovations of AI-based CRM systems Khadija Alnofeli, Shahriar Akter and Venkata Yanamandram
279
18
Descriptive analytics methods in big data: a systematic literature review Nilupulee Liyanagamage and Mario Fernando
295
214 233
267
Index309
Contributors
Shahriar Akter is a Professor of Digital Marketing, Analytics & Innovation at the University of Wollongong, Australia. Shahriar was awarded his PhD from the UNSW Business School, with a fellowship in research methods from the Oxford Internet Institute, University of Oxford. Shahriar has published in top ranked business journals. His research areas include big data analytics, digital and social media marketing, digital innovations, service systems and variance-based SEM techniques. Khadija Alnofeli is a Doctoral Researcher at the University of Wollongong, Australia. Currently, her doctoral research is centred on examining AI-based CRM capabilities. Khadija presented her paper at the British Academy of Management Conference (2022), titled “The Future of AI-Based CRM”. Prior to commencing her doctoral studies, she worked at a leading aviation company, wherein she adeptly managed both corporate and online sales operations. Nadia Nazir Awan holds a Master’s degree in Business Analytics. Nadia has presented a conference paper in the Kuala Lumpur International Business, Economics and Law Conference in 2017 and published a research paper on “Intentions to become an entrepreneur”. She is currently studying Computer Science and her research interest covers areas including machine learning, Big Data, and artificial intelligence. Tahajjat Begum has a Master’s degree in Digital Innovation Majoring Data Science from Dalhousie University in Halifax, Canada. Her publication “Deep learning models for gesture-controlled drone operation” was published in 2020 in the 16th International Conference on Network and Services. She also won the Statistics Canada National Data Scientist Competition in 2020. She aims to focus her research on Big Data and Data Science in future. Sneh Bhardwaj is a research scholar at the University of Melbourne and a sessional lecturer at the Federation University, Australia. Sneh has published in reputed journals such as Equality, Diversity and Inclusion and Evidence-based HRM. She has extensive engagement in the industry as a trainer/consultant across MNCs such as Nestlé, Danone, Unilever, and Coke. Swadip Bhattacharjee is a PhD candidate at the University of Wollongong, Australia. He also works as an Associate Professor at the University of Chittagong in Bangladesh (on leave). His research interests include environmental behaviour, intellectual capital, corporate governance, and political connections. Kumar Biswas is a lecturer at the University of Wollongong, Australia, having a PhD from the University of Newcastle, Australia. He has published in leading journals including the Journal of Business Research, Journal of Strategic Management, and International Journal of Human Resource Management. Kumar’s research interests lie in Big Data, and the dynamics of Artificial Intelligence.
vii
viii Handbook of big data research methods Sudipta Bose is a Senior Lecturer in Accounting in the Discipline of Accounting and Finance at the University of Newcastle, Australia. His research interests include the capital market, cost of equity capital, carbon emissions and assurance, integrated reporting, corporate governance, big data, and machine learning. Dr Bose has published his scholarly articles in several A*/A category journals. Kevin Carillo is an Associate Professor in Data Science & Information Systems at Toulouse Business School (France). He is the director of the Master’s of Science in Big Data, Marketing & Management. Kevin holds a PhD degree in Information Systems from the School of Information Management of Victoria University of Wellington, New Zealand. His current research interests include artificial intelligence, big data, and data-driven business, free/ open-source software communities, online communities, and peer production. Neha Chaudhuri is Assistant Professor in Information Management at Toulouse Business School, France. She has completed her PhD in Management Information Systems from the Indian Institute of Management Calcutta, India. Her research interests are in deep learning, AI for business, and online misinformation management. Her research has been published in journals including Decision Support Systems along with Proceedings of IEEE and in IS conferences such as HICSS and ICIS. Denis Dennehy is Associate Professor at the School of Management, Swansea University, Wales. He has published in premier journals on topics related to the mediating role of technologies and its implications for people, organizations, and society. Prior to his current position, he was affiliated with the University of Galway, Ireland, and obtained his PhD at University College Cork, Ireland. Sajal Kumar Dey is an Assistant Professor in Accounting at the Jagannath University, Bangladesh. He was awarded his PhD in Accounting from the University of Newcastle, Australia. His research interests include capital market, climate change disclosures, corporate governance, and big data. Madiha Farooqui holds a Master’s degree in Master of Business Analytics from the University of Wollongong, Australia. She was lauded for her achievements to conduct a successful seminar on “Role of Leaders and Technology” in 2016. To further excel in the IT field, she is currently pursuing a Master’s of Computer Science. Her aspiration is to become a researcher in the field of Big Data and Artificial Intelligence. Mario Fernando is Professor of Management and the Director of the Centre for Cross-Cultural Management at the University of Wollongong, Australia. His research and teaching interests are centered on exploring how responsible managerial action leads to positive social change. He has published in European Journal of Marketing, International Journal of Consumer Studies, Journal of Retailing and Consumer Services, Human Relations, Journal of Business Ethics, Public Administration and Development and Electronic Markets. Mario has also published three books. Samuel Fosso Wamba is the Associated Dean of Research at TBS Education, France. His current research focuses on the business value of information technology, blockchain, artificial intelligence for business, business analytics, and big data. He is among the 2% of the most influential scholars globally based on the Mendeley database He ranks in Clarivate TM’s 1%
Contributors ix most cited scholars worldwide for 2020–2021 and in CDO Magazine’s Leading Academic Data Leaders 2021. Samrat Gupta is an Assistant Professor in Information Systems at the Indian Institute of Management, Ahmedabad, India. He has published in premier journals in network analysis, online platforms, and social media. His research has been awarded by several organizations including IIM Bangalore, IDRBT and University of Chile. Before starting his academic career, Samrat worked as a software engineer in the insurance and healthcare sector. Mohammad Mahmudul Haque has been working for Oracle for the last 11+ years. Currently, he holds the position of Principal Cloud Architect in the Oracle CAPAC Services Singapore Branch office. He enjoys working with organizations to help them strategize how to leverage the latest technologies in cloud data platforms to achieve their business goals. Gowri Harinath is a research aspirant. Gowri graduated with a Master’s of Business Law from Monash University, Australia, and a Bachelor of Commerce (Accounting and Taxation) from Bangalore University, India. Her research interests lie in privacy and data protection in Artificial Intelligence integrating policies, society and technology. Gowri currently works in the Due Diligence and Compliance Department in the Health and Medical sector at a Private Health Provider in Sydney, Australia. Sahadat Hossain is a PhD candidate at the University of Wollongong, Australia. He also works as a Senior Lecturer at the University of Liberal Arts, Bangladesh (on leave). He has worked as a Research Assistant with university and development organizations. He is particularly interested in conducting research on the intersection between organizations’ strategy and behavior. Niloofar Ahmadzadeh Kandi is a PhD student of Business at the University of Wollongong, Australia. Niloofar holds an MBA with a business analytics specialization. Her research interest focuses on Big Data, Machine Learning and the relationship between the business world and advanced AI technology. Her current research evaluates ethical machine learning offerings in the field of marketing. Anjan Karan is a Marketing Professional at Startek company in Bangalore, India. His current role is more on rebooting the customer engagement at the operational process in the digital platform to bring business resilience. He holds an MBA degree from the Cambridge Institute of Technology affiliated with Visvesvaraya Technological University. Nilupulee Liyanagamage is is a Visiting Research Associate at the University of Wollongong, Australia, and a Lecturer at Notre Dame University, Australia. She completed her PhD in Machiavellian Leadership in the School of Business at the University of Wollongong. Shah Miah is a Professor and Head of Business Analytics at the University of Newcastle, Australia. Since receiving his PhD in Business DSS, his research interests have expanded in the subfields of Business Analytics. He has produced over 200 publications, including in top-tier outlets: Journal of the Association of Information Systems, Journal of the Association for Information Science and Technology, and Information and Management. Herman C. Myburgh is the head of the Advanced Sensor Networks (ASN) research group in the Department of Electrical, Electronic, and Computer Engineering at the University of
x Handbook of big data research methods Pretoria, South Africa. His current research interests are in wireless communication systems, sensor fusion, machine learning, and mobile health. He is the inventor of a number of smartphone-based hearing assessment solutions, and he is a co-founder and scientific advisor to a South African digital health company, hearX Group (Pty) Ltd. Serge Nyawa is currently Assistant Professor at TBS Education, France and co-coordinator of the Master of Sciences in Artificial Intelligence and Business Analytics. His research includes Machine learning, Deep Learning, financial systemic risk, estimation and forecasting of high dimensional covolatility matrices for portfolio allocation and risk management or asset pricing. His work has been published in several journals, including Journal of Econometrics, Annals of Operations Research. John Oredo is a lecturer in the Departments of Information Science and Management Science at the University of Nairobi, Kenya. He holds a PhD in Business Administration (Strategic Information Systems) and industry certifications in project management and big data. John has published and reviewed for leading Information Systems journals. Md Lutfur Rahman is a Senior Lecturer (Finance) at the University of Newcastle, Australia. His research interest revolves around climate change and energy finance, corporate governance, corporate social responsibility, financial contagion, and asset pricing. He has published in leading finance and economics journals including Energy Economics, International Review of Financial Analysis, and International Review of Economics and Finance. Shahriar Sajib completed his PhD in Strategic Management from the University of Technology Sydney, Australia. He has co-authored journal articles related to emerging technologies in reputed journals. He worked as a research assistant in funded industry research projects. Currently, he undertakes teaching for the UK-based London Graduate School. S. Nithya Shree is an HR professional at Million Talents Private Limited, a talent acquisition and HR solutions company which is based in Bangalore, India. Her current role is HR specialist to select the best talent professional for the clients by offering the best HR services in the gig economy to establish a long-term partnership with their clients and employees. Dieudonné Tchuente is an Assistant Professor in the Information Management department at TBS Education, France. His research interests include applied machine learning, big data analytics, intelligent transportation systems, user modelling in information systems, and social networks analysis. His work has been published in several journals including Decision Support Systems, Annals of Operations Research, Social Network Analysis and Mining. Hossana Twinomurinzi PhD (IT) is a 4IR Professor with the University of Johannesburg, South Africa. He is currently the Head of the Centre for Applied Data Science which seeks to infuse data science into the business and economics research. His primary research interests are in Applied Data Science, Digital Skills, Digital Government, Digital Innovation and Digital Development. P.S. Varsha is an Assistant Professor at the School of Commerce, Management & Economics, Presidency University, Bangalore, India. Varsha has a PhD from Visvesvaraya Technological University (VTU), a Master’s degree in Marketing and HR specialization and a Bachelor’s
Contributors xi degree in Computer Science. Her current research is in AI and its applications, marketing analytics, HR analytics and Social Network analysis. Venkata Yanamandram is an Associate Professor of Marketing at the University of Wollongong, Australia. Venkat’s peer-reviewed journal articles appear in numerous publications, including Journal of Business Research, Annals of Tourism Research, Journal of Travel and Tourism Marketing, Current Issues in Tourism, Journal of Retailing and Consumer Services, and International Journal of Service Industry Management. Venkat has executed various teaching-related, discipline-specific, and cross-faculty research projects, supported by internal and external grant funding. Sawlat Zaman is a lecturer at the University of Newcastle, UK, with a doctorate degree from Cardiff University, UK. Sawlat has published papers in leading journals such as Thunderbird International Business Review, and book chapters with Palgrave and Edward Elgar Publishing. Her research interest encapsulates global HRM & employment, and AI.
1. Introduction to the Handbook of Big Data Research Methods Shahriar Akter, Samuel Fosso Wamba, Shahriar Sajib and Sahadat Hossain
INTRODUCTION The debate concerning big data analytics (BDA) related opportunities and challenges is gaining increasing attention from academics and practitioners (Grover et al., 2018; Frizzo-Barker et al., 2016). The rapidly developing and expanding analytics movement is very much associated with organizational opportunities such as business innovation and development of data products (Davenport and Kudyba, 2016); cognitive computing and business intelligence (Gupta et al., 2018); prediction of customers’ behavior and sentiment analysis (Ragini et al., 2018; Grover et al., 2018; Shirazi and Mohammadi, 2019); organizational responsiveness and agility (Popovič et al., 2018); and co-creation of knowledge (Acharya et al., 2018). The robust knowledge to leverage BDA effectively across service systems such as healthcare, transport, financial systems, supply chain and logistics management or retail is still scarce (Maglio and Lim, 2016). Contributing more than 70 percent of global GDP, the service sector has significant potential to enhance learning, adaptability and decision-making capability of the present service systems through adopting a data-driven approach to produce superior performance under environmental uncertainty (Medina-Borja, 2015). The application of BDA for performing service system innovation and superior decision making is evident in the extant literature (Opresnik and Taisch, 2015; Ghasemaghaei et al., 2017). For example, using a BDA-driven recommendation engine, Amazon increased its sales revenue by over 30 percent by exploiting real-time information, products and rate comparison data, Marriott achieved an 8 percent increase of revenue-by-revenue optimization, and the retention rate of Capital One increased by 87 percent (Davenport and Harris, 2017). Although small and medium sized enterprises are keen to adopt BDA to replicate similar success achieved by the large enterprises, it is critical to integrate BDA effectively with organizational decision making to produce superior performance outcomes successfully (de Vasconcelos and Rocha, 2019; Côrte-Real et al., 2017). Further, Dwivedi et al. (2017) highlight the rapid progress of big open linked data (BOLD) creating a novel scope of innovation to solve problems, improve quality and performance. In response to the growing importance of explicating the process of producing superior performance through BDA-driven decision making and operational activities, this chapter aims to explicate the processes involved in BDA research methodology within the context of the present competitive landscape. Scholars have reported difficulties experienced by firms in yielding value from BDA programs (Power, 2016; Miah et al., 2017; Kaisler et al., 2013). Ransbotham et al. (2016) note a declining number of successful companies achieving competitive advantage through BDA in recent times. In particular, a significant gap exists in the extant literature regarding the effective operationalization of BDA within a context of practical decision-making scenarios 1
2 Handbook of big data research methods or business problems (Miah et al., 2017; Elgendy and Elragal, 2016). To fill this critical gap, this chapter attempts to answer the key research question, which is: ‘what are the steps in the BDA research method in the present business context?’ Reviewing the extant literature, this chapter outlines a general framework of the BDA research method to help enhance the understanding of data-driven decision making and problem solving. The proposed framework contains six steps, and the essential functions of each step are elaborated incorporating examples from present business operations. The remainder of this chapter is organized in two main sections. First, the six steps of the BDA research method will be presented. Then we offer a discussion reflecting the significance of these steps concerning the present challenges related to BDA research and the future research agenda.
STEPS OF THE BIG DATA RESEARCH METHOD Step 1: Problem Definition Constructing a well-articulated problem definition is the first step of the BDA research method. Recognizing the problem or the decision that needs to be taken is the initial step of the data-driven decision-making process. Appropriate framing of the business problem involves problem recognition, appreciating the significance of the problem and determining expected outcomes of the big data research. Next, Davenport and Kim (2013) recommend specifying the problem through building understanding about the necessary approach to tackle the problem as well as identifying the stakeholders concerned. Finally, Lilien and Rangaswamy (2006) highlight that if the problem can lead to developing choice models it can result in more effective models, generate higher quality insights and enable consistent decisions based on the models. Davenport and Kim (2013) emphasize the importance of framing the problem with utmost clarity to identify the business problem precisely and warn that having access to big data or a sophisticated analytic tool will not be useful if the problem is poorly identified. The extant literature offers rich evidence about the importance of the problem definition phase in big data research. For example, Chauhan et al. (2022) articulate a business problem faced by a local government service unit to fulfill non-emergency service requests due to the variability of time required to organize the service providers. The authors developed a predictive model by utilizing the historical data to uncover the pattern and the type of the service request by the customers to result in increased readiness and reduced waiting time for the customers. Tan et al. (2016) present a case study on how Trustev, a global digital verification technology provider, applied BDA to tackle the issue of identity fraud in e-commerce. Further, Miah et al. (2017) demonstrate that through applying BDA to social media data, tourism firms can perform future and seasonal demand forecasts for superior destination management. These findings confirm that defining the problem is the first step in the BDA research method to specify the business problems along with underlying elements. Step 2: Review of Previous Findings and Context The next step involves rigorous research to reveal the relevant previous findings and insights about the context. Exploration of past findings relevant to the defined problem is necessary
Introduction 3 to frame the problem more precisely (Salehan and Kim, 2016; Davenport, 2013). Further, comprehensive research of the prior findings is critical for organizations and researchers to avoid the potential drawbacks of existing measures (Akter et al., 2019). For example, through reviewing previously applied methods, including the anticipatory shipping model used by Amazon, Lee (2017) explicates that the shipping model is used to predict a consumer’s purchase decision to enable shipping to commence prior to placement of the order. Davenport and Kim (2013) suggest that reviewing past research aids revision of the problem, leading to superior framing reflecting the exploration of different aspects of the context, such as availability of resources or presence of potential constraints within the broader socio-economic and legal environment. Salehan and Kim (2016) identified the importance of extracting the meaning from both the title and the text through applying BDA techniques rather than using the numeric star rating and word count to predict the performance of customers’ reviews. Further, through collaboration with Alibaba, Haier enhanced their capabilities to address issues in e-commerce capabilities applying BDA (Sun et al., 2015). Overall, thorough investigation of past findings and context direct further refinement or more precise framing of the problem to overcome potential limitations. Step 3: Select the Variables and Develop the Model In the third step, the key activities involved are identifying the necessary variables to construct the model and developing the model as the simplified representation of the problem (Akter et al., 2019). To solve the defined problem effectively through modeling, it is critical to formulate hypotheses that appropriately represent the association between different variables and their impact on the outcome (Davenport and Kim, 2013). Lee (2017) developed a predictive model through utilizing variables such as traveling time and transportation cost to determine allocation of products across different distribution centers to optimize the forecasted delivery time. Although at the problem identification stage an expansive approach is beneficial, the objective of step 3 is to produce a precise problem statement and a model that consists of the variables of the study (Akter et al., 2019). Complex models are capable of tackling big data as well as its direct and indirect effect, which it is not possible to accomplish through traditional econometric and statistical models (Wedel and Kannan, 2016). For example, Xiang et al. (2015) constructed a model to estimate the association between guest experience and satisfaction utilizing online consumer reviews in the hospitality industry. Although big data offers opportunities for deeper insight and superior decisions through leveraging the distinctive characteristics of volume, variety and variability, gaining access to the necessary requirements of complex modeling may present a serious challenge (Akter and Wamba, 2016; Sivarajah et al., 2017). In conclusion, by completing a modeling stage, BDA researchers and decision makers should obtain a narrow focus of the identified problem having knowledge about the necessary variables and the relationships that may require measurement. Step 4: Collecting the Data and Testing the Model Collecting all necessary and relevant data and testing the model is the next step in the BDA research method (Janssen et al., 2017; Davenport, 2013). Today’s companies have access to a large volume of data that originates from diverse sources including transactions, interactions
4 Handbook of big data research methods or click stream and video data (Akter and Wamba, 2016 offer a review). The data obtained from these sources can be categorized broadly as structured, semi-structured, and unstructured (Gandomi and Haider, 2015). Structured data is relatively easy to capture, organize and perform queries on and can originate from sources such as operational systems such as transaction data, internal reporting systems and machine data retrieved from automated systems (Phillips-Wren et al., 2015). Structured data comprises four key dimensions, such as attributes, variables, subjects and time (Naik et al., 2008). Because of identifiable features, semi-structured data that lack a robust structure are becoming a popular data source in BDA (Phillips-Wren et al., 2015). Generally, ill-defined and highly variable are key features of unstructured data (e.g., images, wikis, blogs), which are greatly in demand at present. To assist decision makers to make better decisions, it is critical to identify data sources that have strategic significance in relation to the focal problem (Phillips-Wren et al., 2015). Although the ultimate goal of obtaining mass volumes of data is to utilize decision making and create value, only 0.5 percent of collected data has been reported to be analyzed (Bumblauskas et al., 2017). Lack of clarity and understanding about the background of the data and the inherently internal nature of data collection and processing can create constraints producing actionable insights from the data (Sivarajah et al., 2017; Chen and Zhang, 2014). For analytical purposes companies use primary data sources such as behavioral, historical and transactional data (Chiang and Yang, 2018). For example, He et al. (2017) provide a detailed explanation about how companies exploit the Twitter data of their competitors for strategic planning and competitive positioning. Further, Guo et al. (2017) develop a cost-effective method to monitor the market position of firms in real time through applying BDA on publicly available online content. Technological developments such as cloud storage and social media have greater access to a massive amount of secondary data that is publicly accessible. These examples emphasize the importance of deciding the potential data source considering the usefulness and relevance of the data in relation to the focal problem and model. Step 5: Analyze the Data Effective exploitation of big data requires appropriate tools and approaches (Al Nuaimi et al., 2015). Adoption of advanced analytics makes a significant contribution to enhancing the quality of insights obtained from the data as well as the decisions that can be made based on big data. The analytical method carries importance across different stages of the BDA research; however, uncovering relationships that are embedded in the data can be facilitated by selection and adoption of an appropriate analytical method (Davenport and Kim, 2013; Phillips-Wren et al., 2015). At the initial phase of data analysis, data mining or extracting and cleansing are performed on data having properties such as diversity, divergence and interrelatedness (Chen et al., 2013). Next, aggregation and integration of data are performed in order to apply appropriate BDA techniques for data analysis (Sivarajah et al., 2017). BDA techniques for data analysis are grouped under three broad categories: descriptive, predictive and prescriptive (Wang et al., 2016; Delen and Demirkan, 2013), while categories such as pre-emptive analytics, inquisitive and diagnostic analytics are also suggested by scholars (Sivarajah et al., 2017; Wedel and Kannan, 2016). Descriptive analytics reflects the past and the current state of the problem and applies tools and statistical methods that are useful in creating periodic business reports, on-demand or ad-hoc reports providing insights on novel opportunities or problems (Delen and Demirkan, 2013). For example, He et al. (2017) utilized
Introduction 5 sentiment analysis, text mining and comparative analytics to analyze social media data to assess competitors. Further, to optimize anticipatory shipping, Lee (2017) applied predictive and anticipatory analytics using a combination of genetic algorithm (GA) and cluster-based association rule mining. Predictive analytics is used to predict or forecast associations or trends following the inherent relationships among the variables (Waller and Fawcett, 2013). Predictive analytics is useful to anticipate future possibilities of an event and can explicate potential causes (Delen and Demirkan, 2013). Prescriptive analytics is crucial for data-driven decision making and comprises analytical techniques that enable finding the optimal course of action or a decision considering multiple alternatives under different constraints (Phillips-Wren et al., 2015). Further, prescriptive modeling can produce rich information and expert opinions (Watson, 2014) and positive outcomes such as service enhancement and cost reduction can be achieved through randomized and optimization testing (Joseph and Johnson, 2013). Scholars have suggested more analytical categories that are to some extent variations of the above categories, such as diagnostic analytics that is used to test hypotheses and estimate relationships between variables (Wedel and Kannan, 2016), inquisitive analytics that is used to accept or reject business propositions, and pre-emptive analytics that is used to enable precautionary actions to be taken to prevent undesirable influences. The extant literature provides analytic techniques such as sentiment analysis (Phillips-Wren and Hoskisson, 2015), algorithm analysis and machine-learning (Lee, 2017; Tan et al., 2016), and regression analysis (Loukis et al., 2012). Overall, there are several analytical methods available for big data analysis. However, it is essential for BDA research to identify relevant methods that suit the problem at hand and to develop human resources that can perform that advanced analytics. Step 6: Actions on Insights The final step of the BDA research method involves taking actions on the defined problem reflecting insights obtained from BDA. To gain plausible insights it is critical to interpret the data comprehensively followed by the data analysis. Interpretation can be facilitated through visualizing with enhanced understandability for the purpose of extracting meaning and knowledge (Kokina et al., 2017). For example, Singh et al. (2017) have developed models based on machine learning which help predict the helpfulness of consumer reviews using textual features such as polarity, subjectivity or reading ease. Moreover, Saboo et al. (2016) identified that companies can increase their revenue by more than 17 percent by reallocating their marketing resources based on the time-varying effects model. These examples showcase how interpretations of BDA can produce actionable insights for decision makers. Zhong et al. (2016) noted that the purpose of BDA is to untangle implicit information and knowledge from big data to facilitate higher quality decisions gaining acceptance. Communication plays a vital role in gaining acceptance of BDA-driven decisions (Sutanto et al., 2008). Sharma et al. (2014) recommend communicating the results following a story-telling approach to ensure understandability across the stakeholders, while Davenport and Kim (2013) suggest that adopting a story-telling approach can be effective to the stakeholders with a limited background in analytics. Steps involved in the BDA-driven decision-making process involve both technology and technicality, such as analysis and interpretation and necessitate human intervention to produce insights that are meaningful and actionable (Jagadish et al., 2014). Therefore, while analyzing data using data analytics tools to reveal novel insights, Sharma et
6 Handbook of big data research methods al. (2014) emphasize nurturing an active engagement process between the BDA analyst and the business managers. Finally, a better outcome following a BDA research method requires actions taken by the decision makers reflecting the derived insights and knowledge.
DISCUSSION The general framework involving the six steps of the BDA research method offers a useful guideline to utilize BDA in the decision-making process. Although the stepwise process is depicted as linear and sequential, each step can enhance other steps, which gives decision makers the opportunity to revise the new learning. For example, during the analysis stage, analysts may choose to derive more knowledge through collecting more data or may reveal newer problems that may result in revision of the entire model. Under these circumstances, the BDA research method is more effective than the traditional structured decision-making procedure (Elgendy and Elragal, 2016). However, the BDA research method presents challenges, such as those identified by scholars in big data sources, data processing, and management and big data itself (Janssen et al., 2017; Zicari, 2014; Sivarajah et al., 2017). These challenges have a significant impact on how firms embrace BDA in their decision-making process. Identifying value extraction through BDA is a multi-step process, therefore Jagadish et al. (2014) warn that a narrow focus on a few steps may prevent value creation being achieved. Davenport (2013) and Davenport and Kim (2013) applied a conceptual framework in articulating the six steps that are presented in this study. The authors consider the problem recognition and review of previous findings as a ‘problem framing’ analytical stage. The following three steps of modeling, data collection, and data analysis comprise the ‘solving the problem’ stage. The final step – acting on insights – represents the analytical stage of ‘communicating and acting on results’. The problem framing or identification stage is widely discussed in the literature investigating decision-making and design science process models. For example, Simon (1977) suggested that the first phase of the decision-making model focused on formulating and identifying the problems along with the surrounding context and conditions. Similarly, Peffers et al. (2007) have identified that ‘problem identification and motivation’ is critical to uncover the underlying elements of the problem. These decision-making models and theories can offer extensive guidance on decision processes aiming to integrate BDA. Further, through adopting methodologies in decision science and social science, BDA research can maintain co-evolutionary progress (Miah et al., 2017). After framing the problem properly, solutions need to be determined using models, data collection and analysis applying appropriate techniques. The inherent challenges of big data are a key constraint for integration of BDA. Some key features of big data that are important for gaining benefits also present challenges, including inconsistency, heterogeneity, incompleteness, lack of trustworthiness, abundance and rapidity of big data (Gandomi and Haider, 2015; Chen et al., 2013; Jagadish et al., 2014). Therefore, scholars and practitioners pay increasing attention to overcoming poor data quality (Hazen et al., 2014) and data binge or overload of data (Bumblauskas et al., 2017). The voluminousness, variety and velocity of big data coupled with data sources, and deficit of knowledge regarding the massive effort required in data collection have raised concern about data acquisition, warehousing and processing (Paris et al., 2014). Further, challenges due to unstructured, diverse, vibrant and unreliable characteristics of big data cause serious constraints in mining, cleansing, integrating and analyzing the
Introduction 7 data (Chen et al., 2013; Karacapilidis et al., 2013). Organizations need human resources or a pool of talent to carry out this process (Dubey and Gunasekaran, 2015). Finally, the BDA process may necessitate collaboration with diverse stakeholders having different skill-sets (Janssen and Kuk, 2016). Therefore, along with the technological resources, organizations need to keep giving serious attention to the human aspect of effective BDA integration in the decision-making process.
CONCLUSION To expedite the effective adoption and integration of BDA within the decision-making process across present business organizations requires understanding and executing the BDA research method in a systematic manner. BDA analysts and researchers should pay necessary attention to following the critical steps discussed in this chapter to ensure that the intended outcomes are achieved. Furthermore, organizations need to prepare their talent capabilities reflecting the renewed expectations from the organizational human resources in the present big data era. Sound diffusion of BDA capabilities and promoting managers possessing extensive and specialized business domain expertise will be beneficial in organizational decision making supported by the BDA research method.
REFERENCES Acharya, A., Singh, S.K., Pereira, V. and Singh, P. (2018). Big data, knowledge co-creation and decision making in fashion industry. International Journal of Information Management, 42, 90–101. Akter, S. and Wamba, S.F. (2016). Big data analytics in e-commerce: A systematic review and agenda for future research. Electronic Markets, 26, 173–94. Akter, S., Bandara, R., Hani, U., Wamba, S.F., Foropon, C. and Papadopoulos, T. (2019). Analytics-based decision-making for service systems: A qualitative study and agenda for future research. International Journal of Information Management, 48, 85–95. Al Nuaimi, E., Al Neyadi, H., Mohamed, N. and Al-Jaroodi, J. (2015). Applications of big data to smart cities. Journal of Internet Services and Applications, 6, 25. Bumblauskas, D., Nold, H., Bumblauskas, P. and Igou, A. (2017). Big data analytics: Transforming data to action. Business Process Management Journal, 23, 703–20. Chauhan, A.S., Cuzzocrea, A., Fan, L., Harvey, J.D., Leung, C.K., Pazdor, A.G. and Wang, T. (2022). Predictive big data analytics for service requests: A framework. Procedia Computer Science, 198, 102–11. Chen, C.P. and Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences, 275, 314–47. Chen, J., Chen, Y., Du, X., Li, C., Lu, J., Zhao, S. and Zhou, X. (2013). Big data challenge: A data management perspective. Frontiers of Computer Science, 7, 157–64. Chiang, L.L. and Yang, C. (2018). Does country-of-origin brand personality generate retail customer lifetime value? A Big Data analytics approach. Technological Forecasting and Social Change, 130, 177–87. Côrte-Real, N., Oliveira, T. and Ruivo, P. (2017). Assessing business value of big data analytics in European firms. Journal of Business Research, 70, 379–90. Davenport, T.H. (2013). Keep up with your quants. Harvard Business Review, July–August, 120–23. Davenport, T.H. and Harris, J.G. (2017). Competing on Analytics: The New Science of Winning. Boston, MA: Harvard Business School Press. Davenport, T.H. and Kim, J. (2013). Keeping up with the Quants: Your Guide to Understanding and Using Analytics. Boston, MA: Harvard Business School Press.
8 Handbook of big data research methods Davenport, T.H. and Kudyba, S. (2016). Designing and developing analytics-based data products. MIT Sloan Management Review, 58, 83–9. De Vasconcelos, J.B. and Rocha, Á. (2019). Business analytics and big data. International Journal of Information Management, 46, 320–21. Delen, D. and Demirkan, H. (2013). Data, information and analytics as services. Decision Support Systems, 55, 359–63. Dubey, R. and Gunasekaran, A. (2015). Education and training for successful career in big data and business analytics. Industrial and Commercial Training, 47, 174–81. Dwivedi, Y.K., Janssen, M., Slade, E.L., Rana, N.P., Weerakkody, V., Millard, J., Hidders, J. et al. (2017). Driving innovation through big open linked data (BOLD): Exploring antecedents using interpretive structural modelling. Information Systems Frontiers, 19, 197–212. Elgendy, N. and Elragal, A. (2016). Big data analytics in support of the decision making process. Procedia Computer Science, 100, 1071–84. Frizzo-Barker, J., Chow-White, P.A., Mozafari, M. and Ha, D. (2016). An empirical study of the rise of big data in business scholarship. International Journal of Information Management, 36, 403–13. Gandomi, A. and Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35, 137–44. Ghasemaghaei, M., Hassanein, K. and Turel, O. (2017). Increasing firm agility through the use of data analytics: The role of fit. Decision Support Systems, 101, 95–105. Grover, P., Kar, A.K., Dwivedi, Y.K. and Janssen, M. (2018). Polarization and acculturation in US election 2016 outcomes – Can twitter analytics predict changes in voting preferences. Technological Forecasting and Social Change. Accessed at https://doi.org/10. 1016/j.techfore.2018.09.009. Guo, L., Sharma, R., Yin, L., Lu, R. and Rong, K. (2017). Automated competitor analysis using big data analytics: Evidence from the fitness mobile app business. Business Process Management Journal, 23, 735–62. Gupta, S., Kar, A.K., Baabdullah, A. and Al-Khowaiter, W.A.A. (2018). Big data with cognitive computing: A review for the future. International Journal of Information Management, 42, 78–89. Hazen, B.T., Boone, C.A., Ezell, J.D. and Jones-Farmer, L.A. (2014). Data quality for data science, predictive analytics, and big data in supply chain management: An introduction to the problem and suggestions for research and applications. International Journal of Production Economics, 154, 72–80. He, W., Wang, F.-K. and Akula, V. (2017). Managing extracted knowledge from big social media data for business decision making. Journal of Knowledge Management, 21, 275–94. Jagadish, H., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J.M., Ramakrishnan, R. and Shahabi, C. (2014). Big data and its technical challenges. Communications of the ACM, 57, 86–94. Janssen, M. and Kuk, G. (2016). Big and open linked data (BOLD) in research, policy, and practice. Journal of Organizational Computing and Electronic Commerce, 26, 3–13. Janssen, M., van der Voort, H. and Wahyudi, A. (2017). Factors influencing big data decision-making quality. Journal of Business Research, 70, 338–45. Joseph, R.C. and Johnson, N.A. (2013). Big data and transformational government. IT Professional, 15, 43–8. Kaisler, S., Armour, F., Espinosa, J.A. and Money, W. (2013). Big data: Issues and challenges moving forward. Proceedings of the 46th Hawaii International Conference on System Sciences (HICSS), pp. 995–1004. Karacapilidis, N., Tzagarakis, M. and Christodoulou, S. (2013). On a meaningful exploitation of machine and human reasoning to tackle data-intensive decision making. Intelligent Decision Technologies, 7, 225–36. Kokina, J., Pachamanova, D. and Corbett, A. (2017). The role of data visualization and analytics in performance management: Guiding entrepreneurial growth decisions. Journal of Accounting Education, 38, 50–62. Lee, C.K.H. (2017). A GA-based optimisation model for big data analytics supporting anticipatory shipping in retail 4.0. International Journal of Production Research, 55, 593–605. Lilien, G.L. and Rangaswamy, A. (2006). Marketing Engineering: Computer-assisted Marketing Analysis and Planning. CreateSpace Independent Publishing Platform. Loukis, E., Pazalos, K. and Salagara, A. (2012). Transforming e-services evaluation data into business analytics using value models. Electronic Commerce Research and Applications, 11, 129–41.
Introduction 9 Maglio, P.P. and Lim, C.-H. (2016). Innovation and Big Data in smart service systems. Journal of Innovation Management, 4, 11–21. Medina-Borja, A. (2015). Editorial column—Smart things as service providers: A call for convergence of disciplines to build a research agenda for the service systems of the future. Service Science, 7, ii–v. Miah, S.J., Vu, H.Q., Gammack, J. and McGrath, M. (2017). A big data analytics method for tourist behaviour analysis. Information & Management, 54, 771–85. Naik, P., Wedel, M., Bacon, L., Bodapati, A., Bradlow, E., Kamakura, W., Kreulen, J. et al. (2008). Challenges and opportunities in high-dimensional choice data analyses. Marketing Letters, 19, 201–13. Opresnik, D. and Taisch, M. (2015). The value of big data in servitization. International Journal of Production Economics, 165, 174–84. Paris, J., Donnal, J.S. and Leeb, S.B. (2014). NilmDB: The non-intrusive load monitor database. IEEE Transactions on Smart Grid, 5, 2459–67. Peffers, K., Tuunanen, T., Rothenberger, M.A. and Chatterjee, S. (2007). A design science research methodology for information systems research. Journal of Management Information Systems, 24, 45–77. Phillips-Wren, G. and Hoskisson, A. (2015). An analytical journey towards big data. Journal of Decision Systems, 24, 87–102. Phillips-Wren, G., Iyer, L., Kulkarni, U. and Ariyachandra, T. (2015). Business analytics in the context of big data: A roadmap for research. Communications of the Association for Information Systems, 37, 448–72. Popovič, A., Hackney, R., Tassabehji, R. and Castelli, M. (2018). The impact of big data analytics on firms’ high value business performance. Information Systems Frontiers, 20, 209–22. Power, D.J. (2016). Data science: Supporting decision-making. Journal of Decision Systems, 25, 345–56. Ragini, J.R., Anand, P.M.R. and Bhaskar, V. (2018). Big data analytics for disaster response and recovery through sentiment analysis. International Journal of Information Management, 42, 13–24. Ransbotham, S., Kiron, D. and Prentice, P.K. (2016). Beyond the hype: The hard work behind analytics success. MIT Sloan Management Review, 57. Saboo, A.R., Kumar, V. and Park, I. (2016). Using big data to model time-varying effects for marketing resource (re) allocation. MIS Quarterly, 40, 911–39. Salehan, M. and Kim, D.J. (2016). Predicting the performance of online consumer reviews: A sentiment mining approach to big data analytics. Decision Support Systems, 81, 30–40. Sharma, R., Mithas, S. and Kankanhalli, A. (2014). Transforming decision-making processes: A research agenda for understanding the impact of business analytics on organisations. European Journal of Information Systems, 23, 433–41. Shirazi, F. and Mohammadi, M. (2019). A big data analytics model for customer churn prediction in the retiree segment. International Journal of Information Management, 48, 238–53. Simon, H.A. (1977). The New Science of Management Decision (3rd edn). Englewood Cliffs, NJ: Prentice‐Hall. Singh, J.P., Dwivedi, Y.K., Rana, N.P., Kumar, A. and Kapoor, K.K. (2017). Event classification and location prediction from tweets during disaster. Annals of Operations Research. Accessed at https:// doi.org/10.1007/s10479-017-2522-3. Sivarajah, U., Kamal, M.M., Irani, Z. and Weerakkody, V. (2017). Critical analysis of big data challenges and analytical methods. Journal of Business Research, 70, 263–86. Sun, S., Cegielski, C.G. and Li, Z. (2015). Research note for amassing and analyzing customer data in the age of big data: A case study of Haier’s online-to-offline (O2O) business model. Journal of Information Technology Case and Application Research, 17, 166–71. Sutanto, J., Kankanhalli, A., Tay, J., Raman, K.S. and Tan, B.C. (2008). Change management in interorganizational systems for the public. Journal of Management Information Systems, 25(3), 133–76. Tan, F.T.C., Guo, Z., Cahalane, M. and Cheng, D. (2016). Developing business analytic capabilities for combating e-commerce identity fraud: A study of Trustev’s digital verification solution. Information & Management, 53, 878–91. Waller, M.A. and Fawcett, S.E. (2013). Data science, predictive analytics, and big data: A revolution that will transform supply chain design and management. Journal of Business Logistics, 34, 77–84.
10 Handbook of big data research methods Wang, G., Gunasekaran, A., Ngai, E.W. and Papadopoulos, T. (2016). Big data analytics in logistics and supply chain management: Certain investigations for research and applications. International Journal of Production Economics, 176, 98–110. Watson, H.J. (2014). Tutorial: Big data analytics: Concepts, technologies, and applications. Communications of the Association for Information Systems, 34, 1247–68. Wedel, M. and Kannan, P.K. (2016). Marketing analytics for data-rich environments. Journal of Marketing, 80, 97–121. Xiang, Z., Schwartz, Z., Gerdes, J.H. and Uysal, M. (2015). What can big data and text analytics tell us about hotel guest experience and satisfaction? International Journal of Hospitality Management, 44, 120–30. Zhong, R.Y., Newman, S.T., Huang, G.Q. and Lan, S. (2016). Big data for supply chain management in the service and manufacturing sectors: Challenges, opportunities, and future perspectives. Computers & Industrial Engineering, 101, 572–91. Zicari, R.V. (2014). Big data: Challenges and opportunities. In R. Ekerkar (ed.), Big Data Computing (pp. 103–28). Boca Raton, FL: CRC Press, Taylor & Francis Group.
2. Big data research methods in financial prediction Md Lutfur Rahman and Shah Miah
1. INTRODUCTION The current human living in the digital age rapidly produces more data every two days than the amount of data produced by mankind since the advent of time to the year 2003 (Goldstein et al., 2021). This “big data” is substantially transforming the financial services industry. Consequently, it affects how we teach finance in classrooms and can significantly redesign finance research in the coming years. While conventionally, big data is characterized by the large volume, velocity, and variety (Miah et al., 2021), its challenges and opportunities in finance research may not be well reflected by them.1 Therefore, although academic research in finance has started using big data and associated research methods (Li et al., 2021a; Anand et al., 2021; Aziz et al., 2022), this line of research brings up several questions, such as: Is the meaning of big data in finance different? Can financial economists benefit from big data research methods compared to traditional financial modelling? Can big data be used to answer novel research questions in a novel way? Goldstein et al. (2021) posit that potentially three properties define big data in finance research: large size, high dimension, and complex structure. Finance research is often criticized for sample selection bias and omitted variable bias. Using a larger dataset or big data may help to mitigate these biases. The financial prediction research, in particular, is also challenging as (1) economic problems, in reality, involve lots of variables; (2) the variables’ impact is nonlinear and involves interaction with other variables; and (3) predictions need to be economically meaningful. While these challenges arise due to high dimensionality, big data research (that is, machine learning (ML), artificial intelligence (AI)) may solve these challenges. Further, instead of conventional row–column data format, complex data structures (for example, text, pictures, videos, audio) may add value if they can be used to explain economic behaviour and activities. Deep learning and natural language processing may be used in these cases. Overall, the availability of larger datasets, development, and use of methodologies capable of handling high dimensionality and complex structure may bring novelty in finance research, in particular in financial forecasting. Given this background, the main purpose of this chapter is to provide a systematic review of big data research methods used in finance (for example, ML, natural language processing) with a particular focus on financial predictability, identify the superiority of these approaches in comparison to traditional predictive methods, and provide direction of future research using big data research approaches in financial forecasting.
This is primarily because finance research has commonly relied upon large datasets. For instance, Rösch (2021) explores market liquidity by analysing several billion trades. Jacobs and Müller’s (2020) analysis is based on more than 2 million anomaly country-months. 1
11
12 Handbook of big data research methods Although big data is a relatively recent phenomenon, we can trace the use of big data research methods in finance back to the early 1990s. For instance, Hawley et al. (1990) use artificial neural systems as a tool for financial decision-making. Altman et al. (1994) apply neural networks, and Varetto (1998) uses genetics algorithms for credit risk measurement and improved lending decisions. More recent relevant research in finance focuses on financial prediction using ML and deep learning (DL). For example, Kamiya et al. (2019) use AI and Khandani et al. (2010) and Sigrist and Hirnschall (2019), among others, use ML for default risk prediction. Wang et al. (2017) and Sirignano (2019) apply ML for predicting stock returns. Erel et al. (2021) first use ML in corporate finance and show that ML can choose better directors than humans, potentially because machines are less exposed to biases and conflicts. The energy finance literature also uses ML and DL for predicting energy price and volatility (Li et al., 2021b; Zhao et al., 2017a), energy consumption (Beyca et al., 2019), and carbon footprints (Nguyen et al., 2021), among others. The above examples demonstrate distinct big data research approaches used in mainstream finance research. However, the literature still lacks a complete review of the approaches used in financial prediction literature, their superiority over conventional predictability models, and their potential use in novel research areas. This chapter contributes by filling this gap. Some related review studies have provided an excellent discussion on big data research. For instance, Heaton et al. (2017) explore the use of DL in financial prediction and classification. Athey (2019) discusses the impact of ML in economics research. Mullainathan and Spiess (2017) explain ML as an econometric tool. However, our focus in this chapter is to review the financial predictability literature that uses big data research methods. We start by providing: (1) a discussion on the definition of big data in finance; (2) big data research approaches (ML and DL) used in financial prediction (asset return, loan default, and energy price); (3) future direction of research. Since the big data research methods in financial prediction is a multidisciplinary phenomenon, the literature is vast and diverse. However, to keep the review relevant and rigorous, the study purposively concentrates on sample articles that are quality examples in the relevant fields as well as published previously in the top-tier finance, economics, econometrics, and forecasting journals.
2.
BIG DATA IN FINANCE
Since the introduction of the term “big data”, existing studies have still shown limitations in providing a universally accepted definition of this. Conventionally big data are characterized by “large volume, velocity, rapid emergence, and varied formats” (Miah et al., 2021, p. 2). Additionally, time variation, diverse sources, and volatility are also typical characteristics of big data. “Big data” is often defined as huge datasets that cannot be stored, managed, and analysed by typical software tools. Big data is also referred to as having 7Vs: value, volume, velocity, variety, validity, volatility and veracity. However, it is understandable that the definition of big data in finance is different from that used in engineering and statistics (Goldstein et al., 2021). This is because these disciplines’ concentration is on capturing, curating, managing and processing data. On the other hand, researchers in finance attempt to use big data to address novel economic questions. As mentioned previously, Goldstein et al. (2021) define big data in finance research: large size, high dimension, and complex structure.
Big data research methods in financial prediction 13 2.1 Size As the term “big data” suggests, big data are large in both an absolute and relative sense. In an absolute sense, the amount of data (in the form of files, records, tables, etc.) is large, often representing millions or billions of data points. For example, transaction-level equity and bond data, credit card purchases, and product reviews. In a relative sense, big data as representation is always larger than existing or any forms of “small data”. We typically use small data simply because they are a subset of the big dataset. This practice leads to sample selection and often data snooping bias if the economic characteristics of large data are not correctly depicted in small datasets. The use of big data in finance research can mitigate these biases. 2.2 Dimension Big data in finance research typically holds a large number of variables in relation to the sample size, and big data handling methods (for example, ML) attempt to extract information from the complex interaction of these variables. ML or AI-based techniques can address novel research questions if a research problem involves a large number of variables. Conventional approaches (such as ordinary least squares (OLS) regressions) typically exhibit poor performance if the predictive model involves lots of predictors. Furthermore, conventional models also generate inferior forecasts in the case of complex nonlinear interactions among variables. ML techniques are particularly designed to address these issues. 2.3 Structure Big data in finance can have a complex structure. While we are familiar with structured or two-dimensional data organized in tables (in row and column format), big data can be unstructured, generated by social media, emails, text messages, data sensing devices, video and audio recordings. Although unstructured data can be valuable to finance researchers if they can measure economic activities, we often need specialized applications or custom programs to convert the complex datasets into meaningful insights for decision-makers (for example, investors). As unstructured data are typically high-dimensional and complex-structure in nature, ML and DL techniques are used to extract features of this data. Finance researchers often use natural language processing to derive information (for example, investor sentiment) from huge textual, voice, and audio data files.
3.
BIG DATA RESEARCH METHODS AND PREDICTION OF ASSET RETURNS
Big data research methods (for example, ML) are increasingly used in predicting asset returns. Gu et al. (2020, p. 2225) define ML used in return prediction as: “(a) a diverse collection of high-dimensional models for statistical prediction, combined with (b) so-called ‘regularization’ methods for model selection and mitigation of overfit, and (c) efficient algorithms for searching among a vast number of potential model specifications.” The empirical asset pricing literature has two major research agendas: (1) asymmetric expected return across assets, (2) dynamics of market risk premium. The second agenda is
14 Handbook of big data research methods a fundamental problem of prediction/forecasting where ML approaches play an important role. Thus far, the literature has accumulated a long list of predictor variables that are argued to have predictive power for asset returns. For instance, Green et al. (2013) and Harvey et al. (2016) count over 300 stock-level, firm-level and common predictors describing asset return behaviour. These predictors are also commonly highly correlated. Traditional predictive models (for example, ordinary least squares (OLS) regression, partial least squares (PLS) model, multiple predictive regression, and so on) perform poorly when the number of predictor variables increases and they are highly correlated. Another complicating factor is the functional form (for example, linear, nonlinear, possible interactions among predictors) of the predictors in a predictive model. The conventional asset pricing literature provides little guidance to solve these problems (Gu et al., 2020). Therefore, finance researchers are using ML-based models to address these issues (see, for example, Mascio et al., 2021; Leippold et al., 2021; Obaid and Pukthuanthong, 2021; Azimi and Agrawal, 2021; Buehlmaier and Zechner, 2021; Bianchi et al., 2021). Commonly used ML approaches deal with the challenges of the empirical asset pricing model mentioned in the previous paragraph. For instance, ML algorithms are specialized in prediction tasks; therefore, they are suitable for predicting risk premium. With regard to the long list of predictor variables, ML-based variable selection and dimension reduction techniques reduce degrees of freedom and condense redundant variation among predictors. Finally, the complexity related to the functional form of predictors in the predictive model is addressed by ML in three ways. First, ML approaches are diverse and based on a range of dissimilar methods. Second, ML approaches (for example, regression trees, neural networks) are able to incorporate complex nonlinear associations. Third, various functional forms of ML methods reduce biases associated with model overfit and false discovery. Due to the advantages mentioned above, a large collection of literature has already used various ML approaches for asset return prediction. A summary of the key papers is presented in Table 2.1. Gu et al. (2020) use a range of ML techniques for predicting risk premium/excess returns: boosted regression trees, random forests, shallow and deep neural networks. The authors report that ML-based forecasts generate large economic gains compared to conventional regression-based strategies used in the literature, such as linear regression, generalized linear models, principal component regressions, and partial least squares. While Gu et al. (2020) use ML to predict equity risk premium, similar approaches (extreme trees and neural networks) are used by Bianchi et al. (2021) for bond return predictability. Supporting Gu et al. (2020), the authors show that neural network-based forecasts lead to larger economic gain compared to conventional forecasts. These studies, in general, conclude that ML algorithms are particularly suitable for the large collections of predictor variables. They also result in greater predictive accuracy by incorporating nonlinearities in the predictive relationship. Several other ML research methods have also been used in the financial return prediction literature. For instance, Han et al. (2020) use the least absolute shrinkage and selection operator (LASSO)-based model. They utilize a particular version of LASSO termed “encompassing LASSO (E-LASSO)”, which combines the LASSO, forecast combination, and forecast encompassing. The main advantage of this approach is that it implements variable shrinkage flexibly, which is particularly suitable when the predictive model includes many potential predictors. The authors forecast cross-sectional stock returns and report that their approach leads to higher economic gain and lower forecast error than conventional approaches, such as OLS and weighted least squares regression models. Several other researchers use LASSO-type
Big data research methods in financial prediction 15 ML models to predict asset returns, and they predominantly report superior predictive performance of these approaches compared to relevant benchmark models. For example, Mascio et al. (2021) report that a LASSO-based investment strategy consistently generates higher annual returns and lower monthly drawdowns. While most previous studies focus on the US market, Leippold et al. (2021) apply ML in the Chinese stock market and show that predictive performance is statistically significant even after considering transaction cost. Iworiso and Vrontos (2020) show that the LASSO-based directional predictability model for equity premium outperforms the binary probit benchmark model both statistically and economically. Although most studies report the superiority of ML-based models in predicting asset returns, a few studies find that ML algorithms underperform certain benchmark models. To predict cross-sectional stock returns, Huang et al. (2021) use six LASSO-related ML methods (equal weight, combination, encompassing, adaptive, egalitarian, elastic net). Interestingly, although they find that ML techniques generate statistically significant in- and out-of-sample predictability, they typically underperform PLS. In line with Kelly and Pruitt (2013), the authors argue that PLS generate asymptotically consistent forecast and minimum forecast error if the consistency condition is met. While a relatively large collection of literature reports that investor sentiment can effectively predict stock returns (Baker and Wurgler, 2006; Kozak et al., 2018), the extant measures of investor sentiment are often criticized for having low accuracy, low power, and incorrect inferences (Azimi and Agrawal, 2021). Therefore, researchers use ML approaches to develop investor sentiment measures. For instance, Obaid and Pukthuanthong (2021) use ML (convolutional neural network) to classify photos based on investor sentiment and propose a new investor sentiment index (Photo Pessimism). The authors argue that their approach is more accurate, verifiable and cost-effective compared to manual shifting of photos or subjective survey data. Further, the paper shows that since Photo Pessimism captures non-financial impact, it predicts market return reversal and trading volume better than conventional predictive models. In a relevant study, Azimi and Agrawal (2021) use an ML-based text classification approach (recurrent neural networks) to measure investor sentiment. The approach is argued to be more accurate, intuitive and easily interpretable compared to other commonly used approaches (such as word dictionaries and the naïve Bayesian classification method). The authors show that qualitative information contained in annual reports is important in predicting stock returns and both positive and negative sentiment have information content for return prediction when sentiment is measured correctly. ML-based textual analysis is also used by Buehlmaier and Zechner (2021) for predicting stock returns. The authors show that investors underreact to information in the financial media, which results in an increase in the subsequent 12-day return of a long–short merger strategy. ML in return prediction has also been used by Adämmer and Schüssler (2020) and Huang and Zhang (2022), among others. While the papers reviewed in this section predominantly focus on predicting bond and equity prices, researchers have also used ML to forecast other asset prices. For instance, Milunovich (2019) concentrates on Australia’s house price index, while Plakandaras et al. (2015) shed light on exchange rate predictions. Specifically, Milunovich (2019) utilizes 47 different ML and DL algorithms and traditional time series models to forecast house prices and growth rates. The author finds a mixed result in terms of the forecast accuracy of the traditional and ML-based models. Several algorithms outperform the benchmark random walk model in the case of one- and two-quarter forecasts. In some instances, conventional linear autoregressive moving average and vector autoregressive models generate forecasts as accurate as those
16 Handbook of big data research methods generated by ML and DL approaches. The linear support vector regressor (SVR) is found to be the most successful predictive model. Plakandaras et al. (2015) use support vector machines and neural networks to forecast spot prices of five exchange rates. The authors use an ensemble empirical mode decomposition to decompose original exchange rates, apply multivariate adaptive regression splines to select the most appropriate predictor, and apply ML-based predictive models. The paper reports that the implemented approach results in a superior forecast (both in- and out-of-sample) in relation to alternative predictive models. Table 2.1
Big data research methods and prediction of asset returns
Citation
Big data research approach
Key advantages
Conventional model/approach
Mascio et al. (2021)
Least absolute shrinkage and
Superior return forecast
Kitchen sink Logistic
selection operator (LASSO)
Superior prediction of a market
regression
(ML)
downturn
LASSO
Greater predictive accuracy
Leippold et al. (2021)
Partial least squares (PLS), OLS model
Elastic net (Enet) Gradient boosted regression trees (GBRT) Random forest (RF) Neural networks Obaid and Pukthuanthong
Convolutional neural
Accurate, verifiable, and
Manual shifting of photos
(2021)
networks (CNNs)
cost-effective
Subjective survey data
Capture non-financial impact Azimi and Agrawal (2021)
Recurrent neural networks
Accurate, intuitive, and easily
Word dictionaries
interpretable
Naïve Bayesian classification
Capture complex nonlinear
method
dependencies Buehlmaier and Zechner
Random forests
Capture more information
(2021) Bianchi et al. (2020)
Dictionary approaches: weak modal words
Boosted regression trees
Suitable for the large collection
Tonal words Principal component analysis
Random forests
of predictor variables
PLS
Extremely randomized
Allow nonlinearities
Penalized linear regressions
regression trees
Larger economic gain
Shallow and deep neural networks Gu et al. (2020)
Han et al. (2020)
Boosted regression trees
Suitable for the large collection
Ordinary least squares
Random forests
of predictor variables
regressions (OLS)
Shallow and deep neural
Allow nonlinearities
networks
Larger economic gain
Encompassing LASSO
Flexible shrinkage procedure
OLS and weighted least
Reduce forecast error
squares
Significant economic gain Milunovich (2019)
Support Vector Machine
Greater predictive accuracy
(SVM)
Capture nonlinear relationship
Ridge,
Cross-validation for model
Lasso,
selection
Elastic net
Random walk with drift model
Big data research methods in financial prediction 17 Citation
Big data research approach
Key advantages
Conventional model/approach
Rapach and Zhou (2020).
Combination elastic net
Select the most relevant
Multiple predictive regression
(C-Enet)
univariate forecast
Forecast combination
Prevent overfitting Improves accuracy Provides substantive economic value to an investor Sirignano (2019)
Spatial neural network (ML)
More effective use of
Logistic regression
information from deep in the order book Iworiso and Vrontos (2020)
Classification and regression
Higher predictive accuracy
tree
Low misclassification error
Binary probit model
Bayesian classifiers Neural networks Least absolute shrinkage and selection operator Ntakaris et al. (2018)
Ridge regression
Superior prediction of mid-price Logistic regression
Single hidden-layer
movement
feedforward neural based nonlinear regression Renault (2017) Manela and Moreira (2017)
Natural language processing
More transparent and replicable
(ML)
approach
Support Vector Regression
Ability to deal with a large
Dictionary-based classification OLS
feature space Plakandaras et al. (2015)
Ensemble empirical model
Superior forecast
Random walk model
decomposition
Better capture nonlinearities
Monetary exchange rate model
Multivariate adaptive
Autoregressive integrated
regression splines
moving average
Support vector machines Neural networks
4.
BIG DATA RESEARCH METHODS AND PREDICTION OF DEFAULT RISK
A big part of the financial predictability literature involves modelling/prediction of loan default, financial distress, and corporate failures. Bellovary et al. (2007) provide a good review of the default prediction literature. Corporate lending decisions typically rely on models that incorporate “hard” information (such as borrower characteristics reflected in credit files) to generate a numerical score reflecting the creditworthiness of borrowers. In recent years, it is also customary for lending institutions to incorporate private information about borrowers (for example, within-account and across-account customer data) in their risk models. Although these models generally produce reasonably accurate ordinal measures of borrower creditworthiness, they often change only slowly over time. Therefore, these models are unable to capture changes in market conditions (Khandani et al., 2010). Additionally, default prediction models are also often criticized for incorporating only financial information, while qualitative information contained in conference calls, audit reports, and annual reports may make an important supplement to conventional default risk modelling (du Jardin, 2016; Tang et al.,
18 Handbook of big data research methods 2020). Further, early studies on default risk prediction predominantly focus on linear classification models such as linear discriminant analysis or logistic regression (Altman, 1968; Ding et al., 2012; Bauer and Agarwal, 2014; Tian et al., 2015). These models only capture the linear combination of covariates ignoring important information embedded in nonlinear interactions. Another complicating factor of predicting loan default or corporate failure is that the number of defaults is typically much lower than the number of non-defaults, which creates a class imbalance problem (Sigrist and Hirnschall, 2019). Due to the challenges associated with the conventional default prediction model explained in the previous paragraph, researchers are increasingly using various ML approaches. Empirical studies in this strand of the literature apply ML algorithms such as support vector machines, decision trees, and artificial neural networks, among others (see, for example, Kellner et al., 2022; Ari et al., 2021; Fuster et al., 2022; Liu et al., 2021; Sigrist and Hirnschall, 2019; Gogas et al., 2018; Butaru et al., 2016; Jones et al., 2015; Geng et al., 2015). These studies, in general, report that ML-based default risk prediction models provide several advantages, such as: (1) capturing interactions and nonlinearities; (2) solving noise problems; and (3) incorporating high dimensionality. These advantages typically lead to greater forecast accuracy and more benefits to investors than conventional models such as logistic regression, Tobit model, linear quantile regression, discriminant analysis, and other conventional credit scoring models. Table 2.2 provides a summary of the relevant key papers. In a pioneering study, Khandani et al. (2010) use an ML-based (radial basis functions, tree-based classifiers, and support vector machines) cardinal measure of default risk that incorporates traditional credit factors and consumer banking transactions. The authors argue that their approach is able to capture subtle nonlinear relationships that are typically difficult to incorporate in traditional credit default models such as probit/logit regression, discriminant analysis, and other credit scoring models. The authors further report that their ML-based model generates more accurate forecasting than conventional models. Several other researchers have used ML-based default prediction approaches (see, for example, Jones et al., 2015; Lessmann et al., 2015; Yu et al., 2020a; 2020b; Carmona et al., 2019; Pham and Ho, 2021). Jones et al. (2015) focus on a large sample of international credit ratings and compare the predictive performance of conventional (such as logit model) and ML techniques (neural networks, support vector machines, and generalized boosting). Likewise, Lessmann et al. (2015) compare several classification algorithms (random forest, artificial neural network) for credit scoring. Yu et al. (2020a) utilize a novel, dual-weighted fuzzy proximal for credit risk analysis. They use several algorithms such as an artificial neural network with a backpropagation algorithm, standard support vector machine, proximal SVM, and dual weighted fuzzy proximal. In a closely related study, Yu et al. (2020b) concentrate on noise problems in credit risk analysis. The authors use classification and regression trees. Carmona et al. (2019) predict failure in the banking sector in the United States. The authors use extreme gradient boosting and XGBoost to forecast bank financial distress. In a similar study, Pham and Ho (2021) use boosting algorithms to predict bank failure. To address the class imbalance problem mentioned earlier, Sigrist and Hirnschall (2019) introduce a novel binary classification model (Grabit model) that applies gradient tree boosting. Since this model incorporates auxiliary data, using the model to loans made to Swiss SMEs, the authors show a substantial enhancement in predictive performance compared to other competing approaches (linear logit and Tobit models). Kellner et al. (2022) use a neural network-based approach to predict loan-given default (LGD). While prior studies used neural
Big data research methods in financial prediction 19 networks for LGD estimations (Qi and Zhao, 2011; Loterman et al., 2012), in Kellner et al.’s (2022) approach, the network is calibrated for a discrete set of quantiles of LGD distribution. The authors argue that their approach provides a broader description of underlying relations, reduces the computational burden, and tests joint effects. Overall, the model exhibits superior performance over benchmark models in default prediction. The ML-based LGD prediction is also conducted by Bastos (2010), Bellotti and Crook (2012), Yao et al. (2017), among others. They primarily support the results of Sigrist and Hirnschall (2019) and Kellner et al. (2022). Ari et al. (2021) use an ML-based model selection approach (post-r-lasso) to predict nonperforming loans (NPL). The authors show that post-r-lasso is particularly useful when there is a long list of predictor variables. The approach makes a trade-off between predictor variables and sample size, leading to superior forecast accuracy. Likewise, Tang et al. (2020) combine financial, management and textual factors in their predictive model and use several ML approaches such as random forest, gradient boosting decision tree, deep neural network, and recurrent neural network. The authors show that ML approaches better capture high dimensionality and generate higher predictive accuracy. In particular, the paper shows that ensemble classifiers and deep learning models are superior in financial distress prediction compared to other conventional models. Supporting these results, Fuster et al. (2022) show that ML technology (random forest model) results in higher predictive accuracy for loan default than simpler logistic models. Although the papers reviewed earlier in this section focus on default prediction in the conventional lending market, online lending platforms (for example, peer-to-peer) are also popular nowadays. While an accurate prediction of borrowers’ default risk is important in this emerging industry, risk prediction is challenging due to high information asymmetry (Chen et al., 2018). Therefore, ML algorithms are particularly useful. Accordingly, Liu et al. (2021) use several machine learning techniques (random forest, k-nearest neighbour algorithm, and support vector machines) on P2P platform data from China. The main purpose of their paper is to explore the difference between the predicted and actual value of defaults. The authors report that ML algorithms exhibit superior forecasting performance with regard to borrowers’ default probability compared to traditional models (such as logistic regression). Several other studies show that ML-based techniques result in better prediction than the conventional model for online lending platforms (see, for example, Rao et al., 2020; Chen et al., 2020). More specifically, Rao et al. (2020) use a two-stage syncretic cost-sensitive random forest (SCSRF) model to assess the credit risk of borrowers. The authors consider P2P loans in agriculture, rural and farmers sectors, and show the validation of the SCSRF model against the commonly used credit evaluation model. Utilizing P2P lending data from China (Renrendai online platform), Chen et al. (2020) show that incorporating borrowers’ demographic characteristics (such as education, age, gender, and marital status) through ML helps better predict borrowers’ probability of default.
20 Handbook of big data research methods
Table 2.2
Big data and prediction of default risk/business failures
Citation
Big data research approach
Key advantages
Conventional model
Kellner et al. (2022)
Neural network
Capture interactions and
Linear quantile regression
nonlinear impacts Precision in quantile forecast Ari et al. (2021)
Fuster et al. (2022)
Post rigorous least absolute
Useful when the number of
Conventional model selection
shrinkage and selection operator
predictors is large compared to
models
(ML)
the sample size
Random forest
Higher predictive accuracy
Logistic regression
Increased flexibility Tantri (2021).
Gradient boosted decision trees
Reduce the number of defaults
XGboost algorithm Liu et al. (2021)
Conventional loan approval model
Random forest
More accurate prediction
k-nearest neighbor algorithm
More benefits to investors
Logistic regression
Support vector machines (SVM) Yu et al. (2020a)
Artificial neural network with
Generalized ability and high
Linear regression
backpropagation algorithm
practical value
Logistic regression
Standard support vector machine
Superior discriminating power
Proximal SVM Dual weighted fuzzy proximal SVM Yu et al. (2020b) Sigrist and Hirnschall
Classification and regression tree
Solve noise problem
(CART)
Greater accuracy and stability
Gradient tree boosting (ML)
Increased predictive accuracy
(2019)
Linear logit and Tobit model
for imbalanced data, Capture nonlinearities, discontinuities, Robust to outliers.
Tang et al. (2020)
Random forest
Better capture high
Gradient boosting decision tree
dimensionality
Deep neural network
Higher predictive accuracy
Logistic regression
Recurrent neural network Gogas et al. (2018)
Support vector machine
Greater forecast accuracy
Ohlson bankruptcy forecasting
Butaru et al. (2016)
Decision trees and random models A more accurate measure of loss Logistic regression
model (ML) Jones et al. (2015)
probabilities
Neural networks,
Logit/probit model/linear
support vector machines, and
discriminant analysis
generalized boosting (ML) Khandani et al. (2010)
Radial basis functions
Accurate forecasting
Logit model
Tree-based classifiers
Capture nonlinear relationships
Discriminant analysis
Support-vector machines
Credit scoring models
Big data research methods in financial prediction 21
5.
BIG DATA RESEARCH METHODS IN ENERGY FINANCE
Forecasting energy prices is one of the major strands of the financial forecasting literature. Although early studies use classic econometric approaches,2 they appear to generate inferior forecasts compared to more recently developed ML and DL-based predictive models (Li et al., 2021b; Zhao et al., 2017a; Beyca et al., 2019; Nguyen et al., 2021). Energy prices, oil prices in particular, usually exhibit greater fluctuation than other commodities. This is because energy prices are not only determined by the interaction of their demand and supply. The international energy market is also influenced by a large number of exogenous factors such as weather, financial market prices, geopolitical aspects, economic growth, psychological expectations, and so on. Therefore, researchers need forecasting technology that can capture a large number of variables and complex interactions among them. This leads them to use ML methods as they are particularly designed to capture complex characteristics (such as nonlinearity and volatility).3 Table 2.3 presents a summary of the key papers in this strand of the literature. A big part of the energy finance literature covers forecasting crude oil prices and volatilities as they have an important impact on economic activities globally. Oil price prediction also provides important guidelines to policymakers and investors. Studies in this strand of the literature attempt to capture substitution effect from other commodities (such as natural gas, coal and renewable energy), and complex relationships (for example, nonlinearities) with financial markets, geopolitical conditions, economic growth and technological developments. ML approaches are found to be particularly suitable for this. For instance, Zhao et al. (2017a) use a DL ensemble approach (stacked denoising autoencoders) to forecast crude oil prices. Comparing it with several competing conventional models (such as random walk, Markov regime-switching), the authors show that their approach provides superior forecasts as the approach is able to capture the complex relationship of energy prices with various exogenous variables. The DL algorithms used in this paper not only outperform conventional models, but they have also been found to have a superior forecasting power to several ML approaches (for example, support vector machine and feedforward neural network). Yu et al. (2014) argue that single models of crude oil price forecasting (such as support vector machines) are associated with problems of overfitting and sensitivity to parameters. Therefore, the authors use a novel For example, recursive vector autoregressive (Baumeister and Kilian, 2012), forecast combinations (Baumeister and Kilian, 2015), various GARCH models (Klein and Walther, 2016; Hassan et al., 2020), vector autoregressive (VAR) and vector error correction models (Cheng and Cao, 2019), univariate and multivariate linear predictive models and Granger causality-type models (Dai and Kang, 2021; Ederington et al., 2021), VAR-based frequency connectedness (Ferrer et al., 2018), quantile auto regressive distributed lag model (Guo et al., 2021), generalized linear model (Ilyas et al., 2021), among others. 3 More recently researchers increasingly rely on ML-based research methods, such as genetic algorithm (Kaboudan, 2001), neural network (NN) (Moshiri and Foroutan, 2006), and support vector machine (SVM) (Xie et al., 2006), semi-supervised learning (SSL) (Shin et al., 2013), and gene expression programming (GEP) (Mostafa and El-Masry, 2016), among others. While these approaches are mostly single models of original form, researchers also use hybrid ML or NN models such as adaptive network based fuzzy inference system (Ghaffari and Zare, 2009). Hybrid models are also used by Yu et al. (2014), Chiroma et al. (2015), among others. Despite an increasing use of ML approaches, they are often found to generate inferior forecasts compared to deep learning approaches (Mallqui and Fernandes, 2019). Therefore, artificial neural networks and deep learning models are increasingly being used in energy price and volatility forecasting (Zhao et al., 2017b). 2
22 Handbook of big data research methods hybrid learning paradigm that integrates compressed sensing-based denoising (CSD) to an AI-based prediction model (such as an artificial neural network). This approach reduces noise levels and generates superior forecasts compared to benchmark models. In a similar study, Costa et al. (2021) employ a range of ML (random forest, quantile regression forest, xgboost, elastic net, LASSO, ridge) and standard econometric models (ARIMA, vector error correction model, Forecast combinations). Interestingly, the authors find that the forecasting accuracy of the VECM model is as good as ML methods such as Adalasso and Elastic Net in the short term (one month). The LASSO family provides the best forecast for up to six months. Most of the studies focusing on oil price prediction use time-series data of the crude oil market and other exogenous variables (Fan et al., 2008; Shin et al., 2013). However, some other underlying factors, such as investor sentiment, may be a potential driver of global crude oil price. This factor has received less attention in the literature due to the complexity associated with assessing market sentiment. Li et al. (2021b) contribute to this area by exploring the predictive ability of news sentiment for oil returns. The authors apply a natural language processing (NLP) technique to capture investor sentiment from crude oil news headlines and use a DL prediction model (bidirectional long short-term memory neural networks) to forecast oil prices. The model can capture both qualitative and quantitative inputs and generates superior forecasts compared to relevant econometric approaches (such as ARIMA, GARCH, etc.). The authors find that oil price responds positively to positive news shocks and exhibits a weak negative response to negative news shocks. An ML-based model for crude oil price forecasting is also used by Yu et al. (2008), Hao et al. (2020), Li et al. (2019), among others. Although the energy price prediction literature predominantly focuses on oil price prediction, researchers also concentrate on predicting other commodity prices in the energy market, such as coal, natural gas, and electricity (Shao et al., 2020; Lago et al., 2018; Ferrari et al., 2021; Papadimitriou et al., 2014). This interest has been more pronounced in recent years with the deregulation of the power market and high penetration of renewable energy (Shao et al., 2020; Papadimitriou et al., 2014). In a recent study, Shao et al. (2020) use a novel Bayesian extreme learning machine (BELM)-based DL model to forecast short-term electricity clearing prices. The authors show that their approach delivers satisfactory forecasting results. Several other studies have used big data research methods in predicting electricity prices. For example, Lago et al. (2018) use several DL models to forecast Belgian electricity prices. Papadimitriou et al. (2014) apply support vector machines to predict one-day-ahead directional changes in electricity prices. Wang et al. (2016) apply a class of deep neural networks (stacked denoising autoencoder) for forecasting short-term electricity prices in the US market. All these three studies conclude that ML and DL-based algorithms improve predicting performance. Ferrari et al. (2021) apply a novel dynamic factor model based on a penalized maximum likelihood approach to forecast energy commodity prices such as oil, gas and coal. While most previous studies report that ML-based methods generate superior forecasts, Ferrari et al. (2021) show that their approach generates superior forecasts to commonly used ML approaches such as elastic net, LASSO and random forest. The superior forecasting ability is attributed to the model’s ability to capture the sparsity of the latent factor structure behind the data. While the literature predominantly emphasizes predicting energy prices and volatility, more recently we have come across empirical studies focusing on predicting energy demand. This line of research is motivated by the fact that a well-constructed forecasting model is important for a country’s energy policies as the model is likely to provide an estimate of the country’s
Big data research methods in financial prediction 23 energy requirements. Beyca et al. (2019) use several ML techniques (such as support vector machines and artificial neural network approaches) to forecast natural gas consumption in a province of Istanbul. They find that SVR provides a more reliable and accurate forecast than the other approaches. Several other papers use ML-based techniques to forecast energy demand. For instance, Potočnik et al. (2019) use the kernel machine and the recurrent neural network to forecast gas consumption in Slovenia. Panapakidis and Dagoumas (2017) apply the genetic algorithm, adaptive neuro-fuzzy inference system, and feedforward neural network to forecast one-day-ahead gas demand. An artificial neural network is also used by Szoplik (2015) for predicting natural gas demand in Poland. All these studies report the superior predictive ability of ML and AI-based predictive models in terms of lower prediction error compared to other competing models. In recent years, corporations are being heavily pressured by different stakeholders (for example, investors, regulators, suppliers, and the community as a whole) to reduce their carbon footprint (Nguyen et al., 2021). This pressure comes in the form of carbon disclosures and resolutions at companies’ annual general meetings (Flammer et al., 2021). Investors’ focus on firms’ carbon footprint arises as it reflects their climate transition risk, their negative contribution to the environment, and the potential cost of decarbonization. Therefore, predicting firm-level carbon footprint is important for overall financial risk prediction. Motivated by this proposition, Nguyen et al. (2021) use ML (a combination of meta elastic net learner and multiple base-learners) to predict the carbon footprint of over two thousand global firms. The ML approach generates up to 30 percent accuracy gain in relation to existing models. A similar result was obtained by several other researchers as they attempted to predict carbon emission (see, for example, Xu et al., 2019). Table 2.3
Big data research methods and energy finance
Citation
Big data research approach
Key advantages
Conventional model/approach
Li et al. (2021b)
Bidirectional long short-term
Capture both qualitative and
Autoregressive integrated
memory neural network
quantitative inputs
moving average (ARIMA)
(BiLSTM)
Reliable and accurate results
Linear regression Support vector machine Uni-directional long-short-term
Costa et al. (2021)
Regression trees (random forest,
Superior forecast in the
memory ARIMA
quantile regression forest,
intermediate term (up to 6
Vector error correction model
xgboost)
months)
Forecast combinations
Avoid overfitting
OLS regression
Regularization procedures (elastic net, LASSO, ridge) Hao et al. (2020)
LASSO, Ridge, Elastic Net
Improve out-of-sample predictive performance
24 Handbook of big data research methods Citation
Big data research approach
Key advantages
Conventional model/approach
Shao et al. (2020)
Bayesian extreme learning
Detect abnormal fluctuations
OLS regression
machine (BELM) classifier
Avoid incorrect predictions
Minimum redundancy maximum resulting from the unreliable relevance algorithm
statistical assumption
Minimum redundancy maximum relevance algorithm Multivariate sequence segmentation algorithm BELM based deep learning model Li et al. (2019)
Support vector machines
Superior forecast
ARIMA GARCH
optimized by genetic algorithm Backpropagation neural network optimized by genetic algorithm Zhao et al. (2017a)
Deep learning ensemble
Capture complex relationship
Random walk model
Stacked denoising autoencoders
of energy price with
Markov regime-switching
exogenous variables
Support vector regression
Superior forecast Yu et al. (2014)
Artificial neural network:
Reduce noise level
ARIMA
least squares support vector
Superior forecast
Support vector machines
Generalization performance
ARMA
Fewer free parameters
GARCH
regression (LSSVR) Papadimitriou et al. (2014)
Support vector machines
Captures nonlinearities
6.
DIRECTIONS OF FUTURE RESEARCH
Despite the huge amount of literature already focusing on the predictive ability of big data research methods, it is clear that more research is needed in this area. In this section, we discuss some avenues for future research. The literature predominantly reports that ML-based approaches improve the prediction of financial asset returns, corporate failure, commodity prices, and so on, compared to the competing conventional approaches. However, these improved forecasts are simply measurements, and they do not provide economic mechanisms or equilibrium conditions. The commonly used ML methods are not designed to detect the fundamental association between asset prices and conditioning exogenous variables (Gu et al., 2020). However, ML can still be useful in understanding the potential economic mechanisms by restructuring the algorithms. Although some preliminary work has been done in this area (Kelly and Pruitt, 2013; Gu et al., 2021, Feng et al., 2020), this is still an exciting avenue of future research. As ML approaches result in more accurate measurement, risk premiums are associated with less approximation and estimation error. Therefore, identifying plausible economic mechanisms behind asset pricing is less challenging now. As explained previously, the success of ML approaches compared to competing conventional methods is mainly attributable to their ability to capture high-order interaction terms between variables (Mullainathan and Spiess, 2017) and nonlinear terms (Goldstein et al., 2021). This phenomenon can lead to future research developing new models to explain why
Big data research methods in financial prediction 25 the economic impact of a variable is conditional on its association with another variable. The nonlinearity can also motivate new models. While ML helps improve the predictive ability of existing models, we also need new theoretical models to explain the new world of superior predictive ability of the big data research methods. Thus far, empirical studies using ML and focusing on financial prediction predominantly attempt to understand human behaviour (Goldstein et al., 2021). A promising new line of research in financial prediction is investigating predictability when decision-makers are machines (for example, algorithmic trading, robo advisers). As the interpretation of human behaviour from the psychology literature has been used to understand investors’ actions in behavioural finance, algorithmic behaviour (or the psychology of machines) can be explored in future research to understand algorithmic behavioural finance. The application of ML and other big data research methods has been widespread in the business and investment community (Goldstein et al., 2021). Therefore, an interesting new line of research could be how investors and firms react to the technological revolution when they make real decisions. For instance, Cao et al. (2020) show that corporations have revised their 10-Ks and 10-Qs filings format to make them suitable for machine readers. However, many other questions are still unanswered. For example, do investors have less incentive to monitor market prices as more information sources are available? Are investment strategies, in general, less plausible now as market prices quickly aggregate more information due to big data and associated technological advancements? Future research can attempt to answer these questions. The availability of big data provides more information to both sophisticated players (for example, institutional investors and companies) and retail investors. However, the impact of big data may be asymmetric to different economic agents. For example, individual investors’ use of social media may push asset prices further away from fundamentals than it would have otherwise (Chawla et al., 2016). On the other hand, sophisticated arbitragers can quickly take positions against the behavioural biases of retail investors. This interaction is likely to result in high volatility in the market. New research can explore the heterogeneous impact of big data research on different economic agents and their aggregate impact on the market. As explained previously, finance researchers are increasingly using large data sets for various financial predictions. For example, Sirignano (2019) uses second-by-second stock trade data. Several papers rely on natural language processing to derive information from unstructured data such as texts (Renault, 2017; Gentzkow et al., 2019; Azimi and Agrawal, 2021). Buehlmaier and Zechner (2021) utilize the information in financial media to predict stock returns, and Obaid and Pukthuanthong (2021) use photos to generate investor sentiment. Other streams of finance research also use complex datasets. For instance, Li et al. (2021a) use transcripts of earnings call to measure corporate culture. Future research can utilize more complex datasets (for example, audio of corporate conference calls and satellite images of corporate site visits) to understand economic activities and decisions that simpler data cannot capture. The review of relevant literature demonstrates that ML approaches lead to an increase in productive accuracy, and this superior forecast emanates from the improved use of information by ML technologies. However, there is one concern that adoption of new technologies may result in unequal distribution of outcomes (for example, lending decisions) to different societally important categories (such as race, age or gender), which eventually result in improved forecasts. Fuster et al. (2022) provide primary evidence on the distributional impact
26 Handbook of big data research methods of innovations in statistical technology in the US mortgage market. They show that Black and Hispanic borrowers gain less from the ML-based credit-screening system. However, more research needs to be done to explore the impact of new technologies on outcome distribution across financial markets. Most studies focusing prediction of loan default use big data approaches and a large number of account levels, credit bureau, and aggregate macroeconomic data. In practice, lending institutions would segment their loan portfolios based on different distinct categories (for example, prime versus subprime borrowers) and estimate separate models for different segments (Butaru et al., 2016). Such segmentation may result in a model more tailored to a particular segment and greater predictive accuracy. However, empirical studies hardly perform such segmentation. Future research may use ML tools (for example, decision tree models) to facilitate the segmentation process and ultimately result in a more accurate and tailored predictive model.
7. CONCLUSION “Big data” is substantially transforming the financial services industry and significantly redesigning finance research. Although academic research in finance has started using big data and associated research methods (Li et al., 2021a; Anand et al., 2021; Aziz et al., 2022), the literature still needs to understand the benefits of big data research methods compared to traditional financial modelling and whether big data research methods can answer novel research questions in a novel way. In this chapter, we provide a systematic review of big data research methods used in finance with a particular focus on financial predictability, identify the superiority of these approaches compared to traditional predictive methods, and provide direction for future research using big data research approaches in financial forecasting. Reviewing the literature, it is found that a range of ML and AI-based algorithms have been used in predicting asset returns, financial distress, loan default, and energy price forecasting. These algorithms are found to provide superior forecasts (with very few exceptions) and larger economic gain for investors. The increase in predictive ability predominantly comes from these models’ ability to incorporate a large number of variables, capture complex interactions among the variables, and reflect nonlinear dependencies. The ML-based technologies typically outperform conventional models (such as OLS-based regressions) in terms of predictive accuracy. Although ML technologies are already widespread in the financial prediction literature, it is clear that more research needs to be done in this area. For instance, ML can be used to detect plausible economic mechanisms behind asset pricing, and new models need to be developed to explain the complex interactions and nonlinearities in the predictive relationships. Future research can also highlight algorithmic behaviour (or the psychology of machines) and explore the asymmetric impact of big data on different economic agents. Additionally, future research can utilize more complex datasets (for example, audio and satellite images) for financial prediction.
Big data research methods in financial prediction 27
REFERENCES Adämmer, P. and Schüssler, R.A. (2020). Forecasting the equity premium: Mind the news! Review of Finance, 24(6), 1313–55. Altman, E.I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. Journal of Finance, 23(4), 589–609. Altman, E.I., Marco, G. and Varetto, F. (1994). Corporate distress diagnosis: Comparisons using linear discriminant analysis and neural networks (the Italian experience). Journal of Banking & Finance, 18(3), 505–29. Anand, A., Samadi, M., Sokobin, J. and Venkataraman, K. (2021). Institutional order handling and broker-affiliated trading venues. Review of Financial Studies, 34(7), 3364–402. Ari, A., Chen, S. and Ratnovski, L. (2021). The dynamics of nonperforming loans during banking crises: A new database with post-COVID-19 implications. Journal of Banking & Finance, 106140. Athey, S. (2019). The impact of machine learning on economics. In A. Agrawal, J. Gans and A. Goldfarb (eds), The Economics of Artificial Intelligence (pp. 507–52). Chicago, IL: University of Chicago Press. Azimi, M. and Agrawal, A. (2021). Is positive sentiment in corporate annual reports informative? Evidence from deep learning. The Review of Asset Pricing Studies, 11(4), 762–805. Aziz, S., Dowling, M., Hammami, H. and Piepenbrink, A. (2022). Machine learning in finance: A topic modeling approach. European Financial Management, 1–27. Baker, M. and Wurgler, J. (2006). Investor sentiment and the cross‐section of stock returns. Journal of Finance, 61(4), 1645–80. Bastos, J.A. (2010). Forecasting bank loans loss-given-default. Journal of Banking & Finance, 34(10), 2510–17. Bauer, J. and Agarwal, V. (2014). Are hazard models superior to traditional bankruptcy prediction approaches? A comprehensive test. Journal of Banking & Finance, 40, 432–42. Baumeister, C. and Kilian, L. (2012). Real-time forecasts of the real price of oil. Journal of Business & Economic Statistics, 30(2), 326–36. Baumeister, C. and Kilian, L. (2015). Forecasting the real price of oil in a changing world: A forecast combination approach. Journal of Business & Economic Statistics, 33(3), 338–51. Bellotti, T. and Crook, J. (2012). Loss given default models incorporating macroeconomic variables for credit cards. International Journal of Forecasting, 28(1), 171–82. Bellovary, J.L., Giacomino, D.E. and Akers, M.D. (2007). A review of bankruptcy prediction studies: 1930 to present. Journal of Financial Education, 33, 1–42. Beyca, O.F., Ervural, B.C., Tatoglu, E., Ozuyar, P.G. and Zaim, S. (2019). Using machine learning tools for forecasting natural gas consumption in the province of Istanbul. Energy Economics, 80, 937–49. Bianchi, D., Büchner, M. and Tamoni, A. (2021). Bond risk premiums with machine learning. The Review of Financial Studies, 34(2), 1046–89. Buehlmaier, M.M. and Zechner, J. (2021). Financial media, price discovery, and merger arbitrage. Review of Finance, 25(4), 997–1046. Butaru, F., Chen, Q., Clark, B., Das, S., Lo, A.W. and Siddique, A. (2016). Risk and risk management in the credit card industry. Journal of Banking & Finance, 72, 218–39. Cao, S., Jiang, W., Yang, B. and Zhang, A.L. (2020). How to Talk When a Machine is Listening: Corporate Disclosure in the Age of AI (No. w27950). National Bureau of Economic Research. Carmona, P., Climent, F. and Momparler, A. (2019). Predicting failure in the US banking sector: An extreme gradient boosting approach. International Review of Economics & Finance, 61, 304–23. Chawla, N., Da, Z., Xu, J. and Ye, M. (2016). Information diffusion on social media: Does it affect trading, return, and liquidity? 5 November. Available at SSRN: https://ssrn.com/abstract= 2935138 or http://dx.doi.org/10.2139/ssrn.2935138. Chen, S., Gu, Y., Liu, Q. and Tse, Y. (2020). How do lenders evaluate borrowers in peer-to-peer lending in China? International Review of Economics & Finance, 69, 651–62. Chen, X., Huang, B. and Ye, D. (2018). The role of punctuation in P2P lending: Evidence from China. Economic Modelling, 68, 634–43. Cheng, S. and Cao, Y. (2019). On the relation between global food and crude oil prices: An empirical investigation in a nonlinear framework. Energy Economics, 81, 422–32.
28 Handbook of big data research methods Chiroma, H., Abdulkareem, S. and Herawan, T. (2015). Evolutionary neural network model for West Texas intermediate crude oil price prediction. Applied Energy, 142, 266–73. Costa, A., Ferreira, P., Gaglianone, W., Guillén, O., Issler, J. and Lin, Y. (2021). Machine Learning and Oil Price Point and Density Forecasting (No. 544). Central Bank of Brazil, Research Department. Dai, Z. and Kang, J. (2021). Bond yield and crude oil prices predictability. Energy Economics, 97, 105205. Ding, A.A., Tian, S., Yu, Y. and Guo, H. (2012). A class of discrete transformation survival models with application to default probability prediction. Journal of the American Statistical Association, 107(499), 990–1003. Du Jardin, P. (2016). A two-stage classification technique for bankruptcy prediction. European Journal of Operational Research, 254(1), 236–52. Ederington, L.H., Fernando, C.S., Lee, T.K., Linn, S.C. and Zhang, H. (2021). The relation between petroleum product prices and crude oil prices. Energy Economics, 94, 105079. Erel, I., Stern, L.H., Tan, C. and Weisbach, M.S. (2021). Selecting directors using machine learning. Review of Financial Studies, 34(7), 3226–64. Fan, Y., Liang, Q. and Wei, Y.M. (2008). A generalized pattern matching approach for multi-step prediction of crude oil price. Energy Economics, 30(3), 889–904. Feng, G., Giglio, S. and Xiu, D. (2020). Taming the factor zoo: A test of new factors. The Journal of Finance, 75(3), 1327–70. Ferrari, D., Ravazzolo, F. and Vespignani, J. (2021). Forecasting energy commodity prices: A large global dataset sparse approach. Energy Economics, 98, 105268. Ferrer, R., Shahzad, S.J.H., López, R. and Jareño, F. (2018). Time and frequency dynamics of connectedness between renewable energy stocks and crude oil prices. Energy Economics, 76, 1–20. Flammer, C., Toffel, M.W. and Viswanathan, K. (2021). Shareholder activism and firms’ voluntary disclosure of climate change risks. Strategic Management Journal, 42(10), 1850–79. Fuster, A., Goldsmith-Pinkham, P., Ramadorai, T. and Walther, A. (2022). Predictably unequal? The effects of machine learning on credit markets. Journal of Finance, 77(1), 5–47. Geng, R., Bose, I. and Chen, X. (2015). Prediction of financial distress: An empirical study of listed Chinese companies using data mining. European Journal of Operational Research, 241(1), 236–47. Gentzkow, M., Kelly, B. and Taddy, M. (2019). Text as data. Journal of Economic Literature, 57(3), 535–74. Ghaffari, A. and Zare, S. (2009). A novel algorithm for prediction of crude oil price variation based on soft computing. Energy Economics, 31(4), 531–6. Gogas, P., Papadimitriou, T. and Agrapetidou, A. (2018). Forecasting bank failures and stress testing: A machine learning approach. International Journal of Forecasting, 34(3), 440–55. Goldstein, I., Spatt, C.S. and Ye, M. (2021). Big Data in finance. The Review of Financial Studies, 34(7), 3213–25. Green, J., Hand, J.R. and Zhang, X.F. (2013). The supraview of return predictive signals. Review of Accounting Studies, 18(3), 692–730. Gu, S., Kelly, B. and Xiu, D. (2020). Empirical asset pricing via machine learning. Review of Financial Studies, 33(5), 2223–73. Gu, S., Kelly, B. and Xiu, D. (2021). Autoencoder asset pricing models. Journal of Econometrics, 222(1), 429–50. Guo, Y., Li, J., Li, Y. and You, W. (2021). The roles of political risk and crude oil in stock market based on quantile cointegration approach: A comparative study in China and US. Energy Economics, 97, 105198. Han, Y., He, A., Rapach, D. and Zhou, G. (2020). Firm characteristics and expected stock returns. Accessed at SSRN 3185335. Hao, X., Zhao, Y. and Wang, Y. (2020). Forecasting the real prices of crude oil using robust regression models with regularization constraints. Energy Economics, 86, 104683. Harvey, C.R., Liu, Y. and Zhu, H. (2016). … and the cross-section of expected returns. The Review of Financial Studies, 29(1), 5–68. Hassan, K., Hoque, A., Wali, M. and Gasbarro, D. (2020). Islamic stocks, conventional stocks, and crude oil: Directional volatility spillover analysis in BRICS. Energy Economics, 92, 104985.
Big data research methods in financial prediction 29 Hawley, D.D., Johnson, J.D. and Raina, D. (1990). Artificial neural systems: A new tool for financial decision-making. Financial Analysts Journal, 46(6), 63–72. Heaton, J.B., Polson, N.G. and Witte, J.H. (2017). Deep learning for finance: Deep portfolios. Applied Stochastic Models in Business and Industry, 33(1), 3–12. Huang, D., Li, J. and Wang, L. (2021). Are disagreements agreeable? Evidence from information aggregation. Journal of Financial Economics, 141(1), 83–101. Huang, T. and Zhang, X. (2022). Industry-level media tone and the cross-section of stock returns. International Review of Economics & Finance, 77, 59–77. Ilyas, M., Khan, A., Nadeem, M. and Suleman, M.T. (2021). Economic policy uncertainty, oil price shocks and corporate investment: Evidence from the oil industry. Energy Economics, 97, 105193. Iworiso, J. and Vrontos, S. (2020). On the directional predictability of equity premium using machine learning techniques. Journal of Forecasting, 39(3), 449–69. Jacobs, H. and Müller, S. (2020). Anomalies across the globe: Once public, no longer existent? Journal of Financial Economics, 135(1), 213–30. Jones, S., Johnstone, D. and Wilson, R. (2015). An empirical evaluation of the performance of binary classifiers in the prediction of credit ratings changes. Journal of Banking & Finance, 56, 72–85. Kaboudan, M.A. (2001). Compumetric forecasting of crude oil prices. In Proceedings of the 2001 Congress on Evolutionary Computation (IEEE Cat. No. 01TH8546) (Vol. 1, pp. 283–7). IEEE. Kamiya, S., Kim, Y.H. and Park, S. (2019). The face of risk: CEO facial masculinity and firm risk. European Financial Management, 25(2), 239–70. Kellner, R., Nagl, M. and Rösch, D. (2022). Opening the black box: Quantile neural networks for loss given default prediction. Journal of Banking & Finance, 134, 106334. Kelly, B. and Pruitt, S. (2013). Market expectations in the cross‐section of present values. Journal of Finance, 68(5), 1721–56. Khandani, A.E., Kim, A.J. and Lo, A.W. (2010). Consumer credit-risk models via machine-learning algorithms. Journal of Banking & Finance, 34(11), 2767–87. Klein, T. and Walther, T. (2016). Oil price volatility forecast with mixture memory GARCH. Energy Economics, 58, 46–58. Kozak, S., Nagel, S. and Santosh, S. (2018). Interpreting factor models. Journal of Finance, 73(3), 1183–223. Lago, J., De Ridder, F. and De Schutter, B. (2018). Forecasting spot electricity prices: Deep learning approaches and empirical comparison of traditional algorithms. Applied Energy, 221, 386–405. Leippold, M., Wang, Q. and Zhou, W. (2021). Machine learning in the Chinese stock market. Journal of Financial Economics. In press. Lessmann, S., Baesens, B., Seow, H.V. and Thomas, L.C. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring: An update of research. European Journal of Operational Research, 247(1), 124–36. Li, J., Zhu, S. and Wu, Q. (2019). Monthly crude oil spot price forecasting using variational mode decomposition. Energy Economics, 83, 240–53. Li, K., Mai, F., Shen, R. and Yan, X. (2021a). Measuring corporate culture using machine learning. Review of Financial Studies, 34(7), 3265–315. Li, Y., Jiang, S., Li, X. and Wang, S. (2021b). The role of news sentiment in oil futures returns and volatility forecasting: Data-decomposition based deep learning approach. Energy Economics, 95, 105140. Liu, Y., Yang, M., Wang, Y., Li, Y. and Xiong, T. (2021). Applying machine learning algorithms to predict default probability in the online credit market: Evidence from China. International Review of Financial Analysis, 101971. Loterman, G., Brown, I., Martens, D., Mues, C. and Baesens, B. (2012). Benchmarking regression algorithms for loss given default modeling. International Journal of Forecasting, 28(1), 161–70. Mallqui, D.C. and Fernandes, R.A. (2019). Predicting the direction, maximum, minimum and closing prices of daily Bitcoin exchange rate using machine learning techniques. Applied Soft Computing, 75, 596–606. Manela, A. and Moreira, A. (2017). News implied volatility and disaster concerns. Journal of Financial Economics, 123(1), 137–62. Mascio, D.A., Fabozzi, F.J. and Zumwalt, J.K. (2021). Market timing using combined forecasts and machine learning. Journal of Forecasting, 40(1), 1–16.
30 Handbook of big data research methods Miah, S.J., Camilleri, E. and Vu, H.Q. (2021). Big Data in healthcare research: A survey study. Journal of Computer Information Systems, 1–13. Milunovich, G. (2019). Forecasting Australia’s real house price index: A comparison of time series and machine learning methods. Journal of Forecasting, 39(7), 1098–1118. Moshiri, S. and Foroutan, F. (2006). Forecasting nonlinear crude oil futures prices. The Energy Journal, 27(4). Mostafa, M.M. and El-Masry, A.A. (2016). Oil price forecasting using gene expression programming and artificial neural networks. Economic Modelling, 54, 40–53. Mullainathan, S. and Spiess, J. (2017). Machine learning: An applied econometric approach. Journal of Economic Perspectives, 31(2), 87–106. Nguyen, Q., Diaz-Rainey, I. and Kuruppuarachchi, D. (2021). Predicting corporate carbon footprints for climate finance risk analyses: A machine learning approach. Energy Economics, 95, 105129. Ntakaris, A., Magris, M., Kanniainen, J., Gabbouj, M. and Iosifidis, A. (2018). Benchmark dataset for mid‐price forecasting of limit order book data with machine learning methods. Journal of Forecasting, 37(8), 852–66. Obaid, K. and Pukthuanthong, K. (2021). A picture is worth a thousand words: Measuring investor sentiment by combining machine learning and photos from news. Journal of Financial Economics. In press. Panapakidis, I.P. and Dagoumas, A.S. (2017). Day-ahead natural gas demand forecasting based on the combination of wavelet transform and ANFIS/genetic algorithm/neural network model. Energy, 118, 231–45. Papadimitriou, T., Gogas, P. and Stathakis, E. (2014). Forecasting energy markets using support vector machines. Energy Economics, 44, 135–42. Pham, T.T.X. and Ho, H.T. (2021). Using boosting algorithms to predict bank failure: An untold story. International Review of Economics & Finance, 76, 40–54. Plakandaras, V., Papadimitriou, T. and Gogas, P. (2015). Forecasting daily and monthly exchange rates with machine learning techniques. Journal of Forecasting, 34(7), 560–73. Potočnik, P., Šilc, J. and Papa, G. (2019). A comparison of models for forecasting the residential natural gas demand of an urban area. Energy, 167, 511–22. Qi, M. and Zhao, X. (2011). Comparison of modeling methods for loss given default. Journal of Banking & Finance, 35(11), 2842–55. Rao, C., Liu, M., Goh, M. and Wen, J. (2020). 2-stage modified random forest model for credit risk assessment of P2P network lending to “Three Rurals” borrowers. Applied Soft Computing, 95, 106570. Rapach, D.E. and Zhou, G. (2020). Time‐series and cross‐sectional stock return forecasting: New machine learning methods. In E. Jurczenko (ed.), Machine Learning for Asset Management: New Developments and Financial Applications. Wiley, pp. 1–33. Renault, T. (2017). Intraday online investor sentiment and return patterns in the US stock market. Journal of Banking & Finance, 84, 25–40. Rösch, D. (2021). The impact of arbitrage on market liquidity. Journal of Financial Economics, 142(1), 195–213. Shao, Z., Zheng, Q., Yang, S., Gao, F., Cheng, M., Zhang, Q. and Liu, C. (2020). Modeling and forecasting the electricity clearing price: A novel BELM based pattern classification framework and a comparative analytic study on multi-layer BELM and LSTM. Energy Economics, 86, 104648. Shin, H., Hou, T., Park, K., Park, C.K. and Choi, S. (2013). Prediction of movement direction in crude oil prices based on semi-supervised learning. Decision Support Systems, 55(1), 348–58. Sigrist, F. and Hirnschall, C. (2019). Grabit: Gradient tree-boosted Tobit models for default prediction. Journal of Banking & Finance, 102, 177–92. Sirignano, J.A. (2019). Deep learning for limit order books. Quantitative Finance, 19(4), 549–70. Szoplik, J. (2015). Forecasting of natural gas consumption with artificial neural networks. Energy, 85, 208–20. Tang, X., Li, S., Tan, M. and Shi, W. (2020). Incorporating textual and management factors into financial distress prediction: A comparative study of machine learning methods. Journal of Forecasting, 39(5), 769–87. Tantri, P. (2021). Fintech for the poor: Financial intermediation without discrimination. Review of Finance, 25(2), 561–93.
Big data research methods in financial prediction 31 Tian, S., Yu, Y. and Guo, H. (2015). Variable selection and corporate bankruptcy forecasts. Journal of Banking & Finance, 52, 89–100. Varetto, F. (1998). Genetic algorithms applications in the analysis of insolvency risk. Journal of Banking & Finance, 22(10–11), 1421–39. Wang, F., Yan, X. and Zheng, L. (2017). Time‐series and cross‐sectional momentum in anomaly returns. European Financial Management, 27, 736–71. Wang, L., Zhang, Z. and Chen, J. (2016). Short-term electricity price forecasting with stacked denoising autoencoders. IEEE Transactions on Power Systems, 32(4), 2673–81. Xie, W., Yu, L., Xu, S. and Wang, S. (2006). A new method for crude oil price forecasting based on support vector machines. In Proceedings of the International Conference on Computational Science (pp. 444–51). Berlin and Heidelberg: Springer. Xu, G., Schwarz, P. and Yang, H. (2019). Determining China’s CO2 emissions peak with a dynamic nonlinear artificial neural network approach and scenario analysis. Energy Policy, 128, 752–62. Yao, X., Crook, J. and Andreeva, G. (2017). Enhancing two-stage modelling methodology for loss given default with support vector machines. European Journal of Operational Research, 263(2), 679–89. Yu, L., Huang, X. and Yin, H. (2020b). Can machine learning paradigm improve attribute noise problem in credit risk classification? International Review of Economics & Finance, 70, 440–55. Yu, L., Wang, S. and Lai, K.K. (2008). Forecasting crude oil price with an EMD-based neural network ensemble learning paradigm. Energy Economics, 30(5), 2623–35. Yu, L., Zhao, Y. and Tang, L. (2014). A compressed sensing based AI learning paradigm for crude oil price forecasting. Energy Economics, 46, 236–45. Yu, L., Yao, X., Zhang, X., Yin, H. and Liu, J. (2020a). A novel dual-weighted fuzzy proximal support vector machine with application to credit risk analysis. International Review of Financial Analysis, 71, 101577. Zhao, C., Li, X. and Zhu, H. (2017b). Hyperspectral anomaly detection based on stacked denoising autoencoders. Journal of Applied Remote Sensing, 11(4), 042605. Zhao, Y., Li, J. and Yu, L. (2017a). A deep learning ensemble approach for crude oil price forecasting. Energy Economics, 66, 9–16.
3. Big data, data analytics and artificial intelligence in accounting: an overview Sudipta Bose, Sajal Kumar Dey and Swadip Bhattacharjee
1.
INTRODUCTION
Robotisation is rapidly influencing the way in which humans interact, how work gets done, and identifying tasks that can be automated (Chartered Institute of Management Accountants (CIMA), 2022). With the advent of big data, data analytics, and artificial intelligence (AI), accounting and finance professionals can take advantage of several opportunities in this rapidly evolving, disruptive, but advantageous environment (Chartered Institute of Management Accountants (CIMA), 2022). The total amount of data created, captured, reproduced and consumed worldwide is projected to grow rapidly, reaching 64.2 zettabytes or 64.2 trillion gigabytes in 2020 (See, 2021).. Additionally, global data generation is expected to exceed 180 zettabytes over the next five years, reaching a peak in 2025 (See, 2021). This data can be captured and analysed to provide valuable insights that can support future business growth. Consequently, businesses need to employ tools that can help them in converting data into usable information, which requires the use of data analytics tools. Data analytics is the process of identifying patterns and trends in past raw data (often referred to as big data) to predict future events that assist in strategic decision-making, while artificial intelligence (AI) entails data processing, making assumptions, and attempting to make predictions that are beyond human capabilities. Although the term “Big Data” is relatively new to the business world, it is already routinely used for practically every facet of human activity (Vasarhelyi et al., 2015). The fundamental reason for big data’s popularity is that recent advances in information technology, especially the Internet, have made available an exponentially growing amount of information (Vasarhelyi et al., 2015). More specifically, big data is widely recognised as the next frontier for innovation, competition and efficiency (Manyika et al., 2011). In line with this notion, the McKinsey Global Institute (2012) conducted a survey and found that 51 per cent of the global business leaders believe that big data and data analytics are top priorities in their current business function. Furthermore, the Chartered Global Management Accountant (CGMA) (2013) surveyed more than 2000 Chief Financial Officers (CFOs) and finance professionals around the world and concluded that big data would revolutionise the way businesses operate over the next decade. Accounting professionals have an important role to play in big data and data analytics since accounting deals with the recording, information processing, measurement, analysis, and reporting of financial information (Liu and Vasarhelyi, 2014). Accounting practitioners around the world have emphasised the value of big data in the accounting and finance fields. For example, the Association of Chartered Certified Accountants (ACCA) and the Institute of Management Accountants (IMA) (2013a) contend that big data, cloud, mobile, and social platforms are changing the landscape for accounting and finance professionals, and they must 32
Big data, data analytics and artificial intelligence in accounting 33 adapt to the challenges posed by cybercrime, digital service delivery, and artificial intelligence. Similarly, the CGMA (2013) stresses the significance of big data, arguing that it raises significant challenges for the future role of accounting and finance. Accountants, who specialise in providing financial accounts to report on past performance, may be sidelined if they do not embrace this change and appreciate the new technologies (CIMA, 2022). Alternatively, they might grasp the opportunity to become big data champions as a source of evidence to support decision-making and help reinvent the way businesses are done (CGMA, 2013). With rapid advances being made in technology, accounting skills have shifted from using pencil and paper to typewriters, calculators, and eventually spreadsheets and accounting software (Poddar, 2021). Data analytics in accounting is a relatively new skill set that is growing significantly in all areas of accounting. The value of the benefits of data analytics in accounting and finance has grown over time, and has in fact transformed the task processes, particularly those that provide inferences, predictions, estimations, and assurance to decision-makers and information users (Austin et al., 2021). Consequently, accounting scholars, researchers, and practitioners worldwide view data analytics as a valuable tool for gaining new insights into businesses financials, identifying areas for process improvement, and reducing risk. Nowadays, accounting professionals globally provide significant support to their senior executives by analysing and interpreting the large volume of accounting data, most of which comes from non-traditional accounting systems (Siegel, 2013; Haverson, 2014; Davenport and Harris, 2017). Web server logs, the Internet, and mobile phone clickstream recordings, social media, and a large number of machine-generated and sensor-detected systems are implemented to extract financial and non-financial information that is used to analyse and interpret the data to make important business decisions (Bertolucci, 2013; Löffler and Tschiesner, 2013). Added to this, several governments around the world have taken the initiative to improve third-party access to both internal government data and data collected from citizens and businesses as a key measure of administrative procedures, which will create more prospects for data analytics (Casselman, 2015; Office of Management and Budget, 2015). Furthermore, the use of artificial intelligence (AI) technologies (for example, computer visions, natural language processing, speech recognition and machine learning) enables companies to improve their efficiency and obtain valuable insights about their customers and employees to develop more competitive customer and personnel strategies (Ernst & Young, 2020). Accounting professionals are playing a more strategic role in the field of AI technology. Since the advent of big data, data analytics, and AI, accounting professionals have demonstrated their ability in delivering greater value to businesses by generating higher revenues and streamlining processes. The remainder of the chapter is organised as follows. Section 2 discusses big data in accounting. Section 3 describes the data analytics in accounting including the role of data analytics in financial accounting, auditing, and management accounting, followed by section 4, which provides an overview of AI in accounting. The final section (section 5) concludes the chapter.
2.
BIG DATA IN ACCOUNTING
The term “Big Data” is defined as a huge dimension of unstructured and structured data derived from multiple sources (Ernst & Young, 2014). Structured data refers to the highly organised information stored in the relational databases of spreadsheets. In contrast, unstruc-
34 Handbook of big data research methods tured data refers to data from sources that are not highly organised (for example, photos, videos, blogs, presentations, social media posts, satellite imagery, open-ended survey responses, and website content) that generally produce 85 per cent of the world’s information today (Mills et al., 2012). The other data type is semi-structured data that contains both structured (highly organised) and unstructured (not highly organised) elements (for example, emails, zipped files). The term “Big Data” has become a buzzword in the accounting profession in recent years like other trending topics such as blockchain, AI, and machine learning (Boomer, 2018). Although there is no internationally accepted definition of big data, Gartner (2012) defines big data as “high-volume, high-velocity, and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process optimization”. Big data is thus characterised by these “three Vs”, namely volume, velocity, and variety (ACCA and IMA, 2013b). Volume indicates the massive size of datasets (for example, Facebook, Google, Yahoo, blogs, census records), velocity indicates the speed at which the data is generated (for example, stock price data that are generated very rapidly), while variety is the collection of data from different sources (both structured and unstructured) (Cao et al., 2015). There are two other sources of big data, and these are veracity and value. Veracity indicates the accuracy and reliability of the data, while value focuses on the costs and benefits of data collection (Zhang et al., 2015; Merritt-Holmes, 2016; Janvrin and Watson, 2017). Furthermore, according to Zhang et al. (2015), big data includes four different elements, namely high volume, high velocity, high variety and high veracity. The application of continuous auditing now largely depends on automated real-time data analysis due to the large volume and high velocity of data (Vasarhelyi et al., 2010). However, high volume and high velocity create gaps between modern audit analytics and the demand for big data analytics. The huge variety and high veracity also generate new challenges because the current method of auditing has only limited competencies to deal with big data. Although big data encompasses massive veracity, huge volume, and computing supremacy for the collection and processing of information, it cannot be stored, processed and analysed using old-fashioned methods (Gartner, 2012). Big data cannot be used by businesses without a systematic analysis being undertaken. However, after efficiently scrutinising, cleaning, transforming, and properly interpreting the big data, it generates valuable insights (Cao et al., 2015). Furthermore, efficient analysis and appropriate interpretation of this data will speed up the revenue generation process, understand customers’ expectations and provide essential information to interested and relevant users (Trkman et al., 2010; Davenport and Harris, 2017). When it comes to accounting, the purpose of big data is to collect, organise and use data from a range of sources to gain new business insights in real-time. For example, accounting and financial analysts can access real-time data from anywhere with a network connection instead of relying on monthly financial reports.1 Although big data has several implications for business, particularly in terms of accounting software, making financial decisions, analysis of customer consumption patterns, and banking (Davenport and Harris, 2017), the importance of big data varies from organisation to organisation. For example, big data from the Big 4 accounting firms is not considered big for a small accounting and tax consultancy firm (Vasarhelyi et al., 2015). Likewise, the big
1
See https://online.maryville.edu/blog/data-analytics-in-accounting/ (accessed on 8 March 2022).
Big data, data analytics and artificial intelligence in accounting 35 data originating from the Big 4 firms is unlikely to be considered that big compared to NASA (Vasarhelyi et al., 2015). Whether a given amount of data is big or not depends on whether that data exceeds the capabilities of information systems that work with that data. Thus, storage and processing are considered to be the two measures of big data competencies (Vasarhelyi et al., 2015). Despite the fact that big data has become an emerging buzzword for accounting professionals in recent years, the impetus of accounting has always been to provide useful information to interested users (Capriotti, 2014). Although the main goal of accounting professionals is to provide information from a large volume of business records for decision-makers to consider, accounting information originates from diverse sources such as paper-based systems, legacy-based systems, and highly technical business systems (Janvrin and Watson, 2017). Through the implementation of various software and analytical tools, accounting professionals identify, record, summarise, analyse and report financial information for their internal and external users. Both internal and external auditors implement a variety of automated techniques (for example, generalised auditing software) to review accounting information to ensure that managers prepare financial statements in accordance with relevant accounting standards and applicable laws. In this way, big data has changed the practice of measuring business transactions and ensuring their relevance. Additionally, big data offers companies the ability to capture transactions before their formal accounting entry, identify inventory movements before their actual receipt or delivery, identify customer calls before actual service activities are performed, and many more forms of identifying economic activities (Vasarhelyi et al., 2015). The measurement system for accounting transactions has been drastically changed due to new technology-driven changes and the advent of Enterprise Resource Planning (ERP) systems that speed up the process of capturing data and improving the data processing systems (Romero et al., 2012). Although no changes are visible in accounting practices and standards such as International Accounting Standards (IAS) and International Financial Reporting Standards (IFRS), big data is the main cause of a paradigm shift that makes it possible to identify and address business functions earlier (Vasarhelyi et al., 2015). Table 3.1 provides a list of several opportunities and challenges in implementing big data for the accounting and finance profession as identified by the ACCA and the IMA (2013b).
36 Handbook of big data research methods
Table 3.1
Opportunities and challenges of big data for the accounting and finance profession
Area
Opportunity
Challenge
Valuation of data
● Helping companies value their data assets
● Big data can quickly “decay” in value as new data
through the development of robust valuation
assets
methodologies. ● Increasing the value of data through stewardship and quality control.
becomes available. ● The value of data varies according to its use. ● Uncertainty about future developments in regulation, global governance, and privacy rights and what they might mean for data value.
Use of big data in decision making
● Using big data to offer more specialised decision-making support in real time. ● Working in partnership with other departments to calculate the points at which big data can most
● Self-service and automation could erode the need for standard internal reporting. ● Cultural barriers might obstruct data sharing between silos and across organizational boundaries.
usefully be shared with internal and external Use of big data in the management of risk
stakeholders. ● Expanding the data resources used in risk forecast- ● Ensuring that correlation is not confused with causaing to see the bigger picture. ● Identifying risks in real-time for fraud detection and forensic accounting. ● Using predictive analytics to test the risk of longer-term investment opportunities in new markets and products.
tion when using diverse data sources and big data analytics to identify risks. ● Predictive analytic techniques will mean changes to budgeting and return on investment calculations. ● Finding ways to factor failure-based learning from rapid experimentation techniques into processes, budgets and capital allocation.
Source:
ACCA and IMA (2013b).
3.
DATA ANALYTICS IN ACCOUNTING
Data analytics can be defined as “processes by which insights are extracted from operational, financial, and other forms of electronic data internal or external to the organization” (KPMG, 2016). There are a variety of ways in which these insights can be derived, including historical, real-time or predictive. They can be risk-focused (for example, control effectiveness, fraud, waste, abuse, non-compliance with policy/regulatory) or performance-focused (for example, increased sales, decreased costs, improved profitability, etc.) and frequently provide “how?” and “why?” answers to the initial “what?” queries commonly found in the information retrieved from the data (KPMG, 2016). Three distinct characteristics of data analytics are the data itself; the analytics applied to it, and the presentation of results in a way that enables commercial value to be generated (Gantz and Reinsel, 2012). More generally, data analytics encompasses not only the collecting and managing of data but also the visualising and presenting of data using tools, infrastructure, and methods to gain insights from big data (Mikalef et al., 2015). Therefore, data analytics is a systematic process of investigating structured and unstructured data through various techniques such as statistical and quantitative analysis, as well as explanatory and extrapolative models to generate useful information for accounting decision-makers.
Big data, data analytics and artificial intelligence in accounting 37 Accounting has evolved in tandem with changing technology, from the use of pencil and paper to typewriters, calculators, spreadsheets, and accounting software (Poddar, 2021). Most accountants have the ability to analyse data as they are experienced in recording and analysing transactions. In addition, they are well trained to document the accounting information, are familiar with financial statements, and have sufficient experience in various aspects of business decisions, making them experienced trusted advisors to businesses (Haverson, 2014). Therefore, technical skills, analytical thinking, and problem-solving competence have long been a part of the accounting profession. However, accounting data analytics is a relatively new skill-set that is spreading rapidly in the accounting profession. The importance of data analytics has increased over time as accounting professionals implement the data analytics tools to infer, predict, and assure the business data to make a useful decision. For example, corporate executives and senior-level management are adopting data analytics to identify and infer operational inefficiencies (Dai et al., 2019). In addition, data analytics is an important area of investment for public accounting firms, particularly in consulting, tax advisors and auditing services (Earley, 2015). Tax professionals use data analytics to detect tax fraud and forecast future tax liabilities (DaBruzzo et al., 2013). In addition, data analytics is now widely implemented by auditors to identify uncertain, ambiguous, or possibly fraudulent transactions (Vasarhelyi, 2013; Verver and Grimm, 2013; Brown-Liburd et al., 2015), and to understand and assess the safety and control of a client’s huge volume of datasets (Ernst & Young, 2014; Vasarhelyi et al., 2015). Apart from the demand for data analytics by accounting professionals, accounting scholars are considering data analytics in their accounting curricula. The Association to Advance Collegiate Schools of Business (AACSB)’s demand for data analytics as a component of an accounting degree demonstrates the relevance of data analytics given that graduates must possess abilities in creating, distributing, assessing, and interpreting data (Schneider et al., 2015). Additionally, the Pathways Commission on Accounting Higher Education (Behn et al., 2012), funded by the American Accounting Association (AAA) and the American Institute of Certified Public Accountants (AICPA), inspires the AACSB to introduce new accounting program accreditation standards. These include accounting students’ learning experiences including “data creation, data sharing, data analytics, data mining, data reporting, and storage within and across organizations” (AACSB, 2013). Furthermore, businesses implement data analytics tools in their day-to-day functions to increase their profits and reduce costs/overheads in various ways outside of accounting. For example, customer analytics is used in marketing to find and understand consumer purchasing habits and other behavioural patterns so that market trends can be predicted and new opportunities forecast. Algorithmic trading is used to speed up the current stock price monitoring systems. Although unstructured data is not used in business for analysis, the integration of data analytics accelerates the process of using unstructured data to boost the timeliness of business processes.2
2
See https://online.maryville.edu/blog/data-analytics-in-accounting/ (accessed on 8 March 2022).
38 Handbook of big data research methods 3.1
Types of Data Analytics
There are four types of data analytics: descriptive analytics, diagnostic analytics, predictive analytics, and prescriptive analytics. They are summarised in Table 3.2. Table 3.2
Types of data analytics
Types of analytics
Explanations
Descriptive analytics
Provides insight based on past information. Used in standard report generation and in basic What is happening?
Examples spreadsheet functions such as counts, sums, averages and percent changes and in vertical and horizontal analyses of financial statements.
Diagnostic analytics
Examines the cause of past results. Why did it Used happen?
Predictive analytics
in
variance
analyses
and
interactive
dashboards to examine the causes of past outcomes.
Assists in understanding the future and provides Can be used to predict an accounts receivable foresight by identifying patterns in historical balance and collection period for each customer data. What will happen? When and why?
and to develop models with indicators that prevent control failures.
Prescriptive analytics
Assists in identifying the best option to choose Used in identifying actions to reduce the collection to achieve the desired outcome through period of accounts receivable and to optimise the optimisation techniques and machine learning. use of payable discounts. What should we do?
Source:
Tschakert et al. (2016).
Data analytics is often misinterpreted as just descriptive analysis (“what is”). However, what is really valuable is predictive (“what is going to happen”) and prescriptive analysis (“what should we do?”) rather than descriptive analysis (Tschakert et al., 2016). Companies and industries rely heavily on data analytics to take competitive advantage of technological innovations. Thus, regulators, external capital providers, and capital market participants consider the availability of data and their efficient analysis (Tschakert et al., 2016). Descriptive analytics focuses on what happened in the past. The term “Past” refers to any point in time when an event occurred, which could be a month ago or only a minute ago. Today, 90 per cent of companies employ descriptive analytics, the most fundamental type of analytics. This sort of analytics examines both real-time and historical data to provide insights into the future rather than establishing a cause-and-effect relationship between events (Tschakert et al., 2016). Google Analytics is a prominent example of descriptive analytics in action, it provides a concise overview of website activity, such as the number of visits in a certain period or the source of the visitors. Other business applications of descriptive analytics include sales revenue results coupled with purchase, cost per customer, customer credit risk, inventory measurement and accessibility, Key Performance Indicators (KPIs) dashboard, and monthly revenue report (Tschakert et al., 2016). Diagnostic analytics aims to further explore the cause of an event. The diagnostic analysis delves into descriptive analytics data to ascertain the underlying reasons for outcomes. For instance, if the descriptive analytics indicates that sales decreased by 20 per cent in July, it is necessary to determine why, and the logical next step is to apply the diagnostic analytics. This form of analytics is used by businesses because it connects more data and identifies patterns of activity. For example, a freight company can employ diagnostic analytics to determine the cause of sluggish shipments in a particular region.
Big data, data analytics and artificial intelligence in accounting 39 The knowledge gathered from descriptive analytics is applied to predictive analytics (Appelbaum et al., 2017) and seeks to answer the question, “what is likely to happen?” This type of analytics uses past data to make predictions about future events. Therefore, forecasting is the core of predictive analytics. Advanced technology and manpower are required to forecast this analysis, which is based on statistical modelling. Sales forecasting, risk assessment, and customer segmentation to determine customer profitability are examples of commercial applications for predictive analytics (Appelbaum et al., 2017). Prescriptive analytics is at the forefront of data analytics, incorporating insights from previous analyses to determine the best course of action to take in response to a current problem or decision. The prescriptive model uses information about what happened, why it happened, and a variety of “what-might-happen” evaluations to assist the users in selecting the best course of action to take to avoid a future problem. Prescriptive analytics simplifies the implementation and management of sophisticated tools and technologies such as algorithms, business rules and machine learning. AI is an excellent example of predictive analytics. AI systems absorb a significant amount of data in order to learn and make intelligent decisions, and well-designed AI systems are able to communicate and even react to their decisions. Prescriptive analytics and AI are currently being used by most large data-driven organisations (for example, Apple, Facebook, Netflix, and others) to improve decision-making. 3.2
Importance of Data Analytics in Accounting
Data analytics is undoubtedly one of the most transformative technological breakthroughs that has impacted the accounting profession over the past few decades (Schmidt et al., 2020). By integrating accounting data analytics, companies can make efficient business decisions and meet external capital providers coupled with capital market participants’ expectations. The importance of data analytics in boosting a company’s performance is well recognised (Wixom et al., 2013). Accounting data analytics supports companies to confirm that the business is operating efficiently; for example, healthcare organisations can use accounting data analytics to reduce costs (that is, less waste and fraud) while improving the quality of care (that is, safety and efficacy of treatment) (Srinivasan and Arunasalam, 2013). In addition, data analytics helps accountants to track the performance of the organisations and take necessary actions when they detect any deviations. This data-analytical evaluation is important for the long-term feasibility and existence of a company. Accounting data analytics opens up new potential for accountants to offer additional value-added services to their clients; for example auditors can provide more precise recommendations with less margin of error by continuously monitoring larger datasets, and tax accountants leverage data science to quickly examine difficult tax questions that are likely to be introduced to improve users’ experience. This can help in attracting new customers and increasing the percentage of customers who remain loyal to the company over time. The output of data analytics often includes sensitive information that raises concerns about confidentiality or privacy (Schneider et al., 2015). As a result, data misuse can exacerbate potential risks for businesses from various sources that are both internal and external. These risks are well known to accounting executives, who are also well trained in how to deal with them. Accounting data analytics can support in identifying the areas of these business risks that are being challenged by a company and introduce predictive data analytics to make an efficient business decision regarding specific business risks.
40 Handbook of big data research methods Accounting data analytics can be used to uncover the behavioural patterns of customers. These patterns can help companies to develop analytical models that are likely to be used in the future to discover investment opportunities and increase a company’s profit margins (Poddar, 2021). Therefore, accounting data analytics can enhance a company’s profit margin coupled with maximising the wealth of its owners. To maintain the highest level of financial viability, every organisation should regularly evaluate cash flow and optimisation opportunities. Many people were unable to generate cash during the COVID-19 emergency due to forced business closures, lockdowns, stay-at-home directives, and widespread fear of virus transmission (Clayton and McKervey, 2020). While these events are difficult to predict, a clear cash flow picture can help mitigate the suffering associated with such disruptions. Businesses can use data analytics to better understand and manage their sales, inventories, receivables, and client segmentation, which are especially critical during the recovery (Clayton and McKervey, 2020). It can be stated here that data analytics provides detailed insights into the sources and uses of cash and makes it possible to assess the health of both ends of the supply chain. 3.3
Emerging Accounting Technologies
Several data analytics in accounting might support a firm’s auditing and accounting processes. Explanations of those approaches are documented below. 3.3.1 Deep learning Deep learning is an emerging AI technique for analysing large amounts of data to uncover complex and abstract patterns hidden within the raw data (Sun and Vasarhelyi, 2018). Deep learning shows the deeper structure of events and situations in several layers of neural networks by combining the information with more advanced methods (Poddar, 2021). For example, existing data is likely to be used to generate an automated algorithm for specific audit judgement, such as lease categorisation, bad debt calculation, and so on (Poddar, 2021). Several companies across the world are outsourcing deep learning projects in their research centres, for instance, IBM, Watson and others. Renowned accounting firms have invested a significant amount of money in deep learning and AI. Deep learning assists in decision-making throughout the audit process, including planning, internal control review, substantive testing, and completion (Sun and Vasarhelyi, 2018). Companies now understand the importance of deep learning in accounting data analytics so that they can use it in their accounting and auditing processes. 3.3.2 Blockchain technology Blockchain stands for a decentralised information and accounting system that allows for the control and validation of payment transactions while avoiding currency duplication or digital multiplication (Abad-Segura et al., 2021). Using block chain technology, accounting data can be securely stored, instantly shared, and validated by anybody with an interest in this issue (Dai and Vasarhelyi, 2017). Blockchain can serve as an alternative ledger system for accounting records (Coyne and McMickle, 2017) and can help advance accounting information from a double-entry system to a triple-entry system (Abad-Segura et al., 2021). This blockchain is likely to be used to store programs that run only when predetermined conditions are met, and these programs are known as Smart Contracts (Poddar, 2021). These smart contracts have several benefits. For example, if an outlier reaches 100 per cent of the median value of trans-
Big data, data analytics and artificial intelligence in accounting 41 actions, the auditor and the company agree that it is time to evaluate the data using the human eye (Poddar, 2021). Therefore, blockchain is likely to be introduced to identify such outliers and direct them to the auditors. 3.3.3 Predictive analytics Predictive analytics is an advanced analytical tool that can be used to identify real-time insights and predict future events by analysing historical data (Poddar, 2021). Predictive analytics allows organisations to foresee the future, predict outcomes, uncover opportunities, uncover hidden threats, and take quick action to run their business and make insightful future investment decisions. Thus, this predictive analytics has great potential to support businesses significantly. 3.4
Tools of Accounting Data Analytics
A company can use numerous tools to identify its financial performance and position from several angles. Organisations can process their data using the following accounting data analytics tools. 3.4.1 Microsoft Excel Microsoft Excel is a spreadsheet application used for Windows, macOS, Android and iOS and is widely used by businesses worldwide. It has a wide range of features such as calculation of data, summarising numbers, pivot tables, graphing tools, and so on. Microsoft Excel can perform statistical analyses like regression modelling. Microsoft Excel is one of the most significant and robust data analytics tools in the market, and it enhances the efficiency and effectiveness of user expectations. 3.4.2 Business intelligence tools Accounting professionals might benefit from business intelligence tools that help them to identify sustainable and predictive insights from a particular dataset. By using a variety of business intelligence tools, a company can clean the data, model data and create easy-to-understand visualisations (Poddar, 2021). This visualisation provides detailed understandings and helps identify areas that require further development. Such tools generate some shared features that can be easily accessible and understandable to other members of the group. There are various business intelligence tools, such as Datapine, Tableau, Power BI, SAS Business Intelligence, Oracle Business Intelligence, Zoho Analytics, and Good data. 3.4.3 Proprietary tools A proprietary tool is a tool that is devised by a company for its own use. A company develops and utilises this tool internally to produce and sell products and goods/services to its users and customers. Large companies usually introduce proprietary tools such as Interactive Data Extraction and Analysis (IDEA). IDEA is a software application that allows accountants, auditors, and finance professionals to interact with data files. 3.4.4 Machine learning tools Machine learning is a data analysis technique where a software model is trained using data. It is a field of artificial intelligence based on the premise that systems can learn from the training
42 Handbook of big data research methods data, identify patterns, and make judgements with minimal human interaction. Several companies across the world use the most advanced and sophisticated tools in their data analytics in accounting procedures such as “R” and “Python”. These programming languages are mostly employed by companies to perform highly customised and advanced statistical analyses. Python is one of the fastest-growing programming languages available today. Python was originally developed as an object-oriented programming language for use in software and web development but was later extended for use in data research. Python can perform a wide range of research on its own and integrate with third-party machine learning and data visualisation software. On the other hand, R is a popular statistical programming language that statisticians use for statistical analysis, big data analytics, and machine learning. Facebook, Uber, Google, and Twitter use R for behavioural analysis, advertising effectiveness, data visualisation and economic forecasting. Both of these programming languages are used to generate various algorithms that perform regression analysis, detect data clusters and perform other programming tasks. 3.5
Data Analytics in Auditing
Due to the massive and rapid technological advances, companies are continuously developing technologies to improve their business strategy and day-to-day operations. Among these technologies, data analytics has gained popularity across a wide range of organisations, from corporate and government to scientific and academic disciplines, including accounting and auditing (Dagilienė and Klovienė, 2019). According to the Institute of Chartered Accountants of England and Wales (ICAEW), it is vital for the audit profession to keep up with these changes and be proactive in examining how new technological trends may affect auditing methods (Joshi and Marthandan, 2018). Therefore, accounting professionals around the world need to adapt to these technological disruptions. Audit data analytics is believed to have a greater depth of capabilities and broader concept than standard analytical methods because it involves powerful software tools and statistically demanding methodologies (Joshi and Marthandan, 2018). Auditors need to adapt data analytical skills and big data technologies to execute their operational functions efficiently. Both internal and external auditors use data analytics to perform audit functions such as continuous monitoring, continuous auditing, and full-set analysis when sample audits fail to produce a good quality of results (Vasarhelyi et al., 2010; Protiviti, 2020). Data analytics helps auditors to extract useful insights from large volumes of datasets in real-time, allowing them to make evidence-based decisions. In addition, data analytics may assist auditors in the following areas: 1. Providing audit evidence by analysing the general ledger systems of corporations in depth (Malaescu and Sutton, 2015). 2. Detecting fraud and improving other aspects of forensic accounting (Joshi and Marthandan, 2018). 3. Assisting in the detection of anomalies and trends, as well as the comparison of industry data in risk assessment (Wang and Cuthbertson, 2015). 4. By incorporating external data, auditors can provide services and resolve issues for clients that are beyond their current capabilities (Earley, 2015).
Big data, data analytics and artificial intelligence in accounting 43 The AICPA and Rutgers Business School announced a research initiative in December 2015 that focused on the advanced use of data analytics in auditing (Tschakert et al., 2016). The initiative’s aim was to develop a better understanding of how data analytics can be integrated into the audit process to improve the quality of auditing work (Tschakert et al., 2016). Potential advances in data analytics include higher quality audit evidence, minimising repetitive tasks, and better correlations of audit tasks with risks and assertions (Tschakert et al., 2016). Both the AICPA Assurance Services Executive Committee and the Auditing Standards Board Task Force are working on a revised audit data analytics guideline that will be more acceptable and workable than the existing analytical procedure guideline. The new guideline will adopt most of the existing audit analytical procedure guidelines but will also introduce separate guidance on how audit data analytics can be integrated into the overall audit process (Tschakert et al., 2016). Another project is also underway to develop voluntary audit data standards that are likely to support the extraction of data and expedite the utilisation of audit data analytics, and a mechanism to illustrate where audit data analytics might be implemented in a typical audit program (Tschakert et al., 2016). 3.6
Data Analytics in Financial Accounting
Implementing data analytics can help a company to increase its profit margins and gain a competitive advantage. For example, according to a recent survey of accounting professionals conducted by software vendor Sage (2018) of 3000 accounting professionals globally, 56 per cent of accountants believe their practice revenue has risen over the last 12 months due to the adoption of automation. Companies with a limited application of data analytics in their business operations may be forced out of business in the long run. Data analytics is the most important area in which technological transformation can happen fast enough for an organisation and its senior-level executive to adapt. Thus, the change management concept may be considered to take advantage of data analytics. The measurement of accounting information has increasingly lost its informational value due to the significant decline of market value explanation provided by accounting variables (Lev and Zarowin, 1999). However, this decline in information value is particularly evident for emerging knowledge-intensive businesses with a higher intangible intensity, which claims a steadily expanding share of a country’s economy (Srivastava, 2014). The real-time processed data has been appreciated by the economy which is measured through quarterly or annual financial statements (The Economist, 2002; Vasarhelyi and Greenstein, 2003). It is considered to be the most useful information for the external capital providers and capital market participants to understand and interpret the information so that effective decisions can be made (Krahel and Titera, 2015). Current technologies of accounting recording systems are providing more absolute measurements in terms of intangibles, inventory valuations (for example, LIFO, FIFO) and estimation of depreciation compared to traditional accounting and reporting systems (Lev and Zarowin, 1999; Lev, 2000). For example, most students feel unable to learn several old-fashioned inventory accounting methods, while recently, firms are integrating radio-frequency identification (RFID) or barcodes to measure and report the actual inventory (Vasarhelyi et al., 2015). Another set of real-time financial statements reveals more accurate and real-time business-related information compared to traditional accounting and reporting systems, which help the internal and external user to make efficient and convincing investment decisions (Gal,
44 Handbook of big data research methods 2008). This technology-enabled reporting provides more relevant and descriptive disclosure, supporting the analysis and provisioning of management, auditor, and stakeholder dashboards. Thus, the value of the traditional accounting methods that report information has lost its appeal to information users such as external capital providers and several capital market participants. Businesses have added a large amount of data to their traditional data repositories, resulting in massive databases in their ERP systems, of which only a small portion is relevant for financial reporting. Although these “structured data stores” in ERPs are large, they could be overwhelmed by increasing the amounts of less structured data. 3.7
Data Analytics in Management Accounting
Data analytics in management accounting can be defined as how information technology tools are implemented to analyse and interpret a company’s managerial activities (Spraakman et al., 2020). The traditional role of management accountants has focused on participating in management decision-making, designing, planning, and performance management systems, budgetary control, product profitability, and assisting senior management in formulating and implementing organisational strategies (Association of Chartered Certified Accountants (ACCA), 2020). However, as big data and data analytics become more prevalent, their current practices, working data and tools, interactions with management and other departments, and requests for new areas of competence are all likely to expand (Nielsen, 2018; Tiron-Tudor et al., 2021), which include channel profitability, predictive accounting, and business analytics (Appelbaum et al., 2017). Data analytics is viewed as a critical skill for any accounting professional. Currently, management accountants are uniquely prepared to assess an organisation’s data requirements because they have a comprehensive understanding of the organisation and its existing information systems. This helps them to delve into both financial and non-financial data to drive better decision-making (ACCA, 2020; Tiron-Tudor et al., 2021). Therefore, management accountants must comprehend data analytics and be able to convey their findings to upper management, as well as have the business understanding and commercial acumen to assess data analytics results and provide significant commercial analysis and supply suggestions. Management accountants also use data analytics to add value by improving productivity, profitability, and cash flow, as well as managing customers, innovation, and intellectual property, which focuses on new perspectives and ensures business organisations’ long-term viability. From the academic perspective, although researchers have demanded that prior research on data analytics in management accounting lacks sufficient empirical evidence (Cokins, 2014; Dinan, 2015; Lin, 2016), several conceptual papers have explored the implication of data analytics in management accounting (Schneider et al., 2015; Appelbaum et al., 2017). Vasarhelyi et al. (2015) illustrate how accounting evolved from paper-based aggregate information records through charts of accounts/general ledgers and then to big data or data analytics. Consistent with this notion, Pickard and Cokins (2015) argue that accountants are in the best position in a business, allowing them to own and drive a huge number of data analytics to make effective business decisions. They also report that data analytics are likely to be incorporated by the accountants so that they can analyse data ranging from general financial ratios to more sophisticated data analytics techniques such as clustering, regressions, and factor analysis. Similarly, Schläfke et al. (2013) document that accountants have prior knowledge of financial reporting and can therefore implement various contemporary data analytics tools to prepare
Big data, data analytics and artificial intelligence in accounting 45 several financial and non-financial reports. Moreover, Marr (2016) reports that 45 companies throughout the world are using big data in management accounting, such as Walmart, Netflix, Amazon, and Airbnb. Nielsen (2018) argued that management accounting research should capitalise on the opportunity provided by the data analytics movement to develop theories and ideas for fact-based judgements with high external validity. Similarly, Arnaboldi et al. (2017) look into the interaction between technology-enabled networks and the accounting function to spur more research and debate on the topic.
4.
ARTIFICIAL INTELLIGENCE (AI) IN ACCOUNTING
AI is the simulation of human intelligence in machines. It allows machines to think, learn and solve problems in the same way that human brains do. The use of AI enables machines to perform the necessary tasks by mimicking the behaviour of human intelligence. Several companies worldwide have implemented AI in their accounting functions and analysis in order to obtain the benefits of AI. For example, according to a recent survey of 3000 accounting professionals globally conducted by software vendor Sage (2018), 66 per cent of accountants believe they will invest in AI to automate repetitive and time-consuming tasks, while 55 per cent assert they will use AI to improve their business operations. AI was first introduced into accounting more than 30 years ago (Abdolmohammadi, 1987; Brown, 1989). Specifically, AI was employed in financial accounting and auditing in the late 1980s and early 1990s (Barniv et al., 1997; Etheridge and Sriram, 1997). After this period, significant advances were made in other areas of accounting and finance. Companies throughout the world are reaping enormous benefits by integrating AI into accounting tasks, which can be classified as internal or external. For internal purposes, AI is used in accounting functions to produce more accurate and acceptable financial statements. AI can offer information faster than humans due to its competency and consistency in analysing and interpreting accounting data (Petkov, 2020). As a result, the accounting functions performed by AI can provide quick and accurate output. This instant output improves the timeliness of accounting information and helps users in making informed decisions. AI that has been well-trained to attain accuracy, that is, that has been programmed to follow accounting rules, would produce more accurate and consistent accounting information. Consistent with this notion, incorporating AI in accounting functions can eliminate accounting errors and human errors when preparing financial statements. Furthermore, several companies throughout the world have adopted AI with predefined “trained principles”, and these companies are benefiting from improved financial reporting comparability. Accounting firms are currently integrating AI into auditing functions to ensure compliance and reduce managers’ intentional errors. This would limit the ability of managers to use certain formulants’ financial functions. Despite the fact that just a few accounting firms have included AI in their auditing functions, the majority of them use AI to manage audit risk (Zhao et al., 2004). Furthermore, the most notable benefit of incorporating AI into a company’s accounting function is the minimisation of future costs. In the long term, reliance on AI will reduce having to depend on human operations and improve the efficiency and accuracy of a company’s financial reporting. Primarily, there are certain fixed costs associated with the design, development and implementation of AI in a company’s accounting function, as well as some indirect costs associated with monitoring and confirming AI’s performance. Furthermore, another important
46 Handbook of big data research methods cost of AI is its reliance on the entire system because if the system is hacked/attacked and no human backup assistance is available, it will become a liability rather than an advantage to the company (Petkov, 2020). For this reason, proper maintenance of the AI system is an important function of a company before implementing AI. Moreover, Petkov (2020) identifies the following potential accounting functions (as shown in Table 3.3) that can be delegated to AI. Table 3.3
Potential accounting functions to delegate to artificial intelligence (AI)
Human Functions
Cash
● Manual Input of Cash Receipts and Payments ● To scan cash payments/receipts into general ledger (G/L) simi[Use of Journal Entries (hereafter, J/E)]. ● Bank Reconciliation performed by individu-
Artificial Intelligence (AI) Functions larly to how it is done in a Bank Deposit/Withdrawal (regardless of their nature).
als reconciling outstanding cheques, deposits, ● To train AI to perform this reconciliation by analysing reconcilerrors, interest, etc. Accounts Receivable (A/R)
ing inputs and generating bank reconciliation report for reviews
by humans. ● J/E prepared based on contractual obligation ● These tasks could be delegated to AI. Specifically, the receipt (be it oral or verbal, followed by invoice).
of cash payments via wire transfers or cheques at the point
● J/E for collection based on receipt of payment.
of scanning could result in J/E in the system (similar to Bank
● J/E for allowance for doubtful accounts, based
Deposits/Withdrawals).
on estimations and assumptions. Inventory
● J/E for purchases and sales. ● J/E based for lower of cost or market (LCM)
● Delegate to AI capable of identifying movement of inventory (ins and outs) and prepare automatic J/Es.
value, obsolete inventory, etc. (based on his- ● Delegate the estimation of LCM to AI by providing inputs – torical data).
costs (would come directly from G/L and market, from standard created tool sheet capturing market values of inventory from third parties.
Prepaid
● J/E to record initial asset. ● J/E to record period end expense based on use.
● Delegate to AI by training it to scan bank statements and identify such transactions. Humans could continue to be involved to determine duration. Make periodic timely adjustments.
Investments
● J/E for initial recording. ● J/E adjustments based on cost or equity method chosen.
● AI to scan bank statement and identify such purchases, record J/Es. ● To train AI to analyse F/S of invested companies and seek the activity – such as NI and Dividends and prepare J/Es automatically.
Property, plant, and Equipment (PPE) Intangibles
● J/E to record PPE purchases; or disposals if ● AI to scan bank statements and identify transaction related to any.
PPE purchases and disposals.
● J/E for depreciation expense, already done by AI. ● J/E to record intangible purchases; or disposals ● AI to scan bank statements and identify transactions related to if any.
intangible purchases and disposals.
● J/E for amortization expense, already done ● Train AI to perform impairment testing by providing key inputs by AI.
from other departments.
● J/E for goodwill impairment. Accounts payable (A/P)
● J/E prepared based on contractual obligation ● These tasks could be delegated to AI. Specifically, the payment (be it oral or verbal, followed by receipt invoice
of cash payments via wire transfers or cheques at the point
from vendor).
of scanning could result in J/E in the system (similar to Bank
● J/E for payment to vendor. Accrued Expenses
Deposits/Withdrawals).
● J/E prepared based on assumptions and his- ● Train AI to analyse such data and make on demand J/Es based torical data.
on this data.
Big data, data analytics and artificial intelligence in accounting 47
Human Functions
Artificial Intelligence (AI) Functions
Unearned
● J/E to record initial liability.
● Delegate to AI by training to analyse budgets and tie the budgets
Revenue
● J/E to recognize revenue based on use.
to actual revenue order and its performance.
Notes payable ● J/E to record assumption and repayment of ● To teach AI to scan bank statements and identify such (N/P)
N/P.
transactions.
● J/E for interest payment.
● J/E for interest payment should be based on the contract and
Revenues
● Refer to A/R and Inventory
● Refer to A/R and Inventory
Expenses
● Refer to A/P and Inventory
● Refer to A/P and Inventory
therefore could be delegated.
Source:
Adopted from Petkov (2020).
5. CONCLUSION This chapter aims to provide an overview of the growing role played by big data, data analytics, and AI in the accounting profession. Big data, data analytics, and AI have become emerging buzzwords for accounting professionals in recent years. It is becoming increasingly important for accounting professionals to understand the possibilities that big data and data analytics offer to their clients and the industry. Accounting is data-driven, and subsequently, big data can assist accounting professionals in delivering greater value to their clients. Accounting professionals’ job responsibilities are rapidly changing due to the development of big data, data analytics, and AI. It is argued that these emerging technologies pose a threat to accounting professionals. Experts argued that anyone who does not embrace this change and appreciate the new technologies would be left behind (CIMA, 2022). However, accounting professionals have long-standing reputations in the market for their outstanding technical and problem-solving skills as well as their ability to make data-driven decisions. In fact, a recent survey of accounting professionals conducted by software vendor Sage (2018) reports that 83 per cent of clients now have higher expectations from their accountants than they did five years ago. Sage (2018) also reports that 39 per cent of accountants describe themselves as early adopters of technology after surveying among 3000 accountants globally. Individuals who are able to discern patterns in data and translate them into compelling strategic narratives will find themselves at the centre of the twenty-first-century business world (ACCA and IMA, 2013b). Therefore, in a rapidly changing business environment, accounting professionals can seize numerous opportunities by leveraging big data, data analytics, and AI to stay ahead of the competition.
REFERENCES Abad-Segura, E., Infante-Moro, A., González-Zamar, M.D. and López-Meneses, E. (2021). Blockchain technology for secure accounting management: Research trends analysis. Mathematics, 9(14), 1631. Abdolmohammadi, M.J. (1987). Decision support and expert systems in auditing: A review and research directions. Accounting and Business Research, 17(66), 173–85. Appelbaum, D., Kogan, A., Vasarhelyi, M. and Yan, Z. (2017). Impact of business analytics and enterprise systems on managerial accounting. International Journal of Accounting Information Systems, 25, 29–44.
48 Handbook of big data research methods Arnaboldi, M., Busco, C. and Cuganesan, S. (2017). Accounting, accountability, social media and big data: Revolution or hype? Accounting, Auditing & Accountability Journal, 30(4), 762–76. Association of Chartered Certified Accountants (ACCA) and the Institute of Management Accountants (IMA) (2013a). Digital Darwinism: Thriving in the face of technology change. Retrieved from https:// www.accaglobal.com/my/en/technical-activities/technical-resources-search/2013/october/digital -darwinism.html (accessed on 10 March 2022). Association of Chartered Certified Accountants (ACCA) and Institute of Management Accountants (IMA) (2013b). Big data: Its power and perils. Retrieved from https://www.accaglobal.com/gb/ en/technical-activities/technical-resources-search/2013/december/big-data-its-power-and-perils.html (accessed on 9 March 2022). Association of Chartered Certified Accountants (ACCA) (2020). Data analytics and the role of the management accountant. Retrieved from https://www.accaglobal.com/hk/en/student/exam-support -resources/professional-exams-study-resources/p5/technical-articles/data-analytics.html (accessed on 9 March 2022). Association to Advance Collegiate Schools of Business (AACSB) (2013). Eligibility procedures and accreditation standards for accounting accreditation. Retrieved from http://www. aacsb.edu/ accreditation/accounting/standards/2013/(accessed on 9 March 2022). Austin, A.A., Carpenter, T.D., Christ, M.H. and Nielson, C.S. (2021). The data analytics journey: Interactions among auditors, managers, regulation, and technology. Contemporary Accounting Research, 38(3), 1888–924. Barniv, R., Agarwal, A. and Leach, R. (1997). Predicting the outcome following bankruptcy filing: A three‐state classification using neural networks. Intelligent Systems in Accounting, Finance & Management, 6(3), 177–94. Behn, B., Ezzell, W.F., Murphy, L.A., Rayburn, J.D., Stith, M.T. and Strawser, J.R. (2012). Pathways to a profession: Charting a national strategy for the next generation of accountants. Commission on Accounting Higher Education. Issues in Accounting Education, 27(3), 595–600. Bertolucci, J. (2013). Big Data: A practical definition. InformationWeek (26 August). Retrieved from http://www.informationweek.com/big-data/big-data-analytics/big-data-a-practical-definition/d/d-id/ 1111290? (accessed on 9 March 2022). Boomer, J. (2018). The value of big data in an accounting firm. Retrieved from https:// www .cpapracticeadvisor.com/firm-management/article/12424744/the-value-of-big-data-in-an-accounting -firm (accessed on 5 March 2022). Brown, C.E. (1989). Accounting expert systems: A comprehensive, annotated bibliography. Expert Systems Review, 2(1–2), 23–129. Brown-Liburd, H., Issa, H. and Lombardi, D. (2015). Behavioral implications of Big Data's impact on audit judgment and decision making and future research directions. Accounting Horizons, 29(2), 451–68. Cao, M., Chychyla, R. and Stewart, T. (2015). Big data analytics in financial statement audits. Accounting Horizons, 29(2), 423–9. Capriotti, R. (2014). Big Data bringing big changes to accounting. Pennsylvania CPA Journal, 85(2), 36–8. Casselman, B. (2015). Big government is getting in the way of big data. Retrieved from https:// fivethirtyeight.com/features/big-government-is-getting-in-the-way-of-big-data/ (accessed on 12 March 2022). Chartered Global Management Accountant (CGMA) (2013). From insight to impact: Unlocking opportunities in Big Data. Retrieved from https://www.cgma.org/resources/reports/insight-to-impact-big -data.html (accessed on 12 March 2022). Chartered Institute of Management Accountants (CIMA) (2022). What Big Data and AI mean for the Finance Professional. Retrieved from https://www.cimaglobal.com/CGMA-Store/Finance-Futurist -Blogs/Blog-What-Big-Data-and-AI-mean-for-the-Finance-Professional/ (accessed on 10 March 2022). Clayton & McKervey (2020). Managing cash flow through data analytics. Retrieved from https://www .primeglobal.net/news/managing-cash-flow-through-data-analytics-clayton-mckervey (accessed on 10 March 2022). Cokins, G. (2014). Top 7 trends in management accounting. Strategic Finance, 95(7), 41–8.
Big data, data analytics and artificial intelligence in accounting 49 Coyne, J.G. and McMickle, P.L. (2017). Can blockchains serve an accounting purpose? Journal of Emerging Technologies in Accounting, 14(2), 101–11. DaBruzzo, R., Dannenfelser, T. and DeRocco, D. (2013). Tax portals and dynamic data analytics: The new view for management and control. Retrieved from http://www.ebookbrowsee.net/tax-portals -anddynamic-data-analytics-pdf-d472497864 (accessed on 12 March 2022). Dagilienė, L. and Klovienė, L. (2019). Motivation to use big data and big data analytics in external auditing. Managerial Auditing Journal, 34(7), 750–82. Dai, J. and Vasarhelyi, M.A. (2017). Toward blockchain-based accounting and assurance. Journal of Information Systems, 31(3), 5–21. Dai, J., Byrnes, P., Liu, Q. and Vasarhelyi, M. (2019). Audit analytics: A field study of credit card after-sale service problem detection at a major bank. In Audit Analytics in the Financial Industry (pp. 17–33). Rutgers Studies in Accounting Analytics. Bingley: Emerald Publishing. Davenport, T. and Harris, J. (2017). Competing on Analytics: Updated, With a New Introduction: The New Science of Winning. Cambridge, MA: Harvard Business Press. Dinan, T. (2015). Predictive analytics can move you from scorekeeper to proactive manager. Pennsylvania CPA Journal, 86, 9–10. Earley, C.E. (2015). Data analytics in auditing: Opportunities and challenges. Business Horizons, 58(5), 493–500. Ernst & Young (2014). Big data: Changing the way businesses compete and operate. Retrieved from https://motamem.org/wp-content/uploads/2019/01/Big-data-applications-and-insights.pdf (accessed on 11 March 2022). Ernst & Young (2020). How to harness artificial intelligence in accounting. Retrieved from https://www .ey.com/en_sg/ai/how-to-harness-artificial-intelligence-in-accounting (accessed on 10 March 2022). Etheridge, H.L. and Sriram, R.S. (1997). A comparison of the relative costs of financial distress models: Artificial neural networks, logit and multivariate discriminant analysis. Intelligent Systems in Accounting, Finance & Management, 6(3), 235–48. Gal, G. (2008). Query issues in continuous reporting systems. Journal of Emerging Technologies in Accounting, 5(1), 81–97. Gantz, J. and Reinsel, D. (2012). The digital universe in 2020: Big data, bigger digital shadows, and biggest growth in the Far East. Retrieved from https://www.cs.princeton.edu/courses/archive/ spring13/cos598C/idc-the-digital-universe-in-2020.pdf (accessed on 10 March 2022). Gartner (2012). The importance of “Big Data”: A definition. Retrieved from http://www.gartner.com (accessed on 11 March 2022). Haverson, A. (2014). Why predictive analytics should be “A CPA Thing”. New York: IMTA Business Intelligence Task Force. Institute of Management Accountants (IMA) (2019). The impact of big data on finance: Now and in the future. Retrieved from https://www.imanet.org/insights-and-trends/technology-enablement/the -impact-of-big-data-on-finance-now-and-in-the-future?ssopc=1 (accessed on 11 March 2022). Janvrin, D.J. and Watson, M.W. (2017). “Big Data”: A new twist to accounting. Journal of Accounting Education, 38, 3–8. Joshi, P.L. and Marthandan, G. (2018). The hype of big data analytics and auditors. Emerging Markets Journal, 8(2), 1–4. KPMG (2016). Leveraging data analytics and continuous auditing processes for improved audit planning, effectiveness, and efficiency. Retrieved from https://assets.kpmg/content/dam/kpmg/pdf/2016/ 05/Leveraging-Data-Analytics.pdf (accessed on 11 March 2022). Krahel, J. and Titera, B. (2015). How standards will/should change with Big Data. Accounting Horizons, 29(2), 409–22. Lev, B. (2000). Intangibles: Management, Measurement, and Reporting. Washington, DC: Brookings Institution Press. Lev, B. and Zarowin, P. (1999). The boundaries of financial reporting and how to extend them. Journal of Accounting Research, 37(2), 353–85. Lin, P. (2016). What CPAs need to know about mobile business analytics. CPA Journal, 86(5), 39–41. Liu, Q. and Vasarhelyi, M.A. (2014). Big questions in AIS research: Measurement, information processing, data analysis, and reporting. Journal of Information Systems, 28(1), 1–17.
50 Handbook of big data research methods Löffler, M. and Tschiesner, A. (2013). The Internet of Things and the future of manufacturing. McKinsey & Company. Retrieved from https://www.mckinsey.com/business-functions/mckinsey-digital/our -insights/the-internet-of-things-and-the-future-of-manufacturing (accessed on 12 March 2022). Malaescu, I. and Sutton, S.G. (2015). The reliance of external auditors on internal audit’s use of continuous audit. Journal of Information Systems, 29(1), 95–114. Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C. and Hung Byers, A. (2011). Big data: The next frontier for innovation, competition, and productivity: McKinsey Global Institute. Retrieved from https://www.mckinsey.com/business-functions/mckinsey-digital/our-insights/big -data-the-next-frontier-for-innovation (accessed on 12 March 202). Marr, B. (2016). Big Data in Practice: How 45 Successful Companies Used Big Data Analytics to deliver extraordinary results. Chichester: John Wiley & Sons. McKinsey Global Institute (2012). Minding your digital business, McKinsey Global Survey results. Retrieved from https://www.mckinsey.com/business-functions/mckinsey-digital/our-insights/ minding-your-digital-business-mckinsey-global-survey-results (accessed on 11 March 2022). Merritt-Holmes, M. (2016). Big Data & analytics: The DNA to a successful implementation in 2016. Retrieved from https://dataanalytics.report/Resources/Whitepapers/40de07ff-696d-40cd-b05d -ae38f26429c3_w2.pdf (accessed on 13 March 2022). Mikalef, P., Pateli, A., Batenburg, R.S. and Wetering, R.V.D. (2015). Purchasing alignment under multiple contingencies: A configuration theory approach. Industrial Management & Data Systems, 115(4), 625–45. Mills, S., Lucas, S., Irakliotis, L., Rappa, M., Carlson, T. and Perlowitz, B. (2012). Demystifying big data: A practical guide to transforming the business of government. TechAmerica Foundation, Washington. Retrieved from https://bigdatawg.nist.gov/_uploadfiles/M0068_v1_3903747095.pdf (accessed on 12 March 2022). Nielsen, S. (2018). Reflections on the applicability of business analytics for management accounting – and future perspectives for the accountant. Journal of Accounting & Organizational Change, 14(2), 167–87. Office of Management and Budget (2015). Fiscal Year 2016 Budget of the U.S. Government. Washington, DC: U.S. Government Printing Office. Retrieved from https://www.govinfo.gov/ content/pkg/BUDGET-2016-BUD/pdf/BUDGET-2016-BUD.pdf (accessed on 13 March 2022). Petkov, R. (2020). Artificial intelligence (AI) and the accounting function: A revisit and a new perspective for developing framework. Journal of Emerging Technologies in Accounting, 17(1), 99–105. Pickard, M.D. and Cokins, G. (2015). From bean counters to bean growers: Accountants as data analysts – A customer profitability example. Journal of Information Systems, 29(3), 151–64. Poddar, A. (2021). Data analytics in accounting: 5 comprehensive aspects. Retrieved from https:// hevodata.com/learn/data-analytics-in-accounting/ (accessed on 10 January 2022). Protiviti (2020). Protiviti’s 2020 internal audit capabilities and needs survey. Retrieved from https:// www.protiviti.com/AU-en/insights/internal-audit-capabilities-and-needs-survey (accessed on 10 March 2022). Romero, S., Gal, G., Mock, T.J. and Vasarhelyi, M.A. (2012). A measurement theory perspective on business measurement. Journal of Emerging Technologies in Accounting, 9(1), 1–24. Sage (2018). Accountants’ adoption of Artificial Intelligence expected to increase as clients’ expectations shift. Retrieved from https://www.sage.com/en-us/news/press-releases/2018/03/accountants -adoption-of-ai-expected-to-increase/ (accessed on 9 March 2022). Schläfke, M., Silvi, R. and Möller, K. (2013). A framework for business analytics in performance management. International Journal of Productivity and Performance Management, 62(1), 110–22. Schmidt, P.J., Riley, J. and Church, K.S. (2020). Investigating accountants’ resistance to move beyond Excel and adopt new data analytics technology. Accounting Horizons, 34(4), 165–80. Schneider, G.P., Dai, J., Janvrin, D.J., Ajayi, K. and Raschke, R.L. (2015). Infer, predict, and assure: Accounting opportunities in data analytics. Accounting Horizons, 29(3), 719–42. See, A.V. (2021). Amount of data created, consumed, and stored 2010–2025. Retrieved from https:// www.statista.com/statistics/871513/worldwide-data-created/ (accessed on 10 March 2022). Siegel, E. (2013). Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die. Hoboken, NJ: John Wiley & Sons.
Big data, data analytics and artificial intelligence in accounting 51 Spraakman, G., Sanchez-Rodriguez, C. and Tuck-Riggs, C.A. (2020). Data analytics by management accountants. Qualitative Research in Accounting & Management, 18(1), 127–47. Srinivasan, U. and Arunasalam, B. (2013). Leveraging big data analytics to reduce healthcare costs. IT Professional, 15(6), 21–8. Srivastava, A. (2014). Why have measures of earnings quality changed over time? Journal of Accounting and Economics, 57(2–3), 196–217. Sun, T. and Vasarhelyi, M.A. (2018). Embracing textual data analytics in auditing with deep learning. The International Journal of Digital Accounting Research, 18, 49–67. The Economist (2002). The real-time economy: How about now? Retrieved from http://www.economist. com/node/949071(accessed on 10 March 2022). Tiron-Tudor, A., Deliu, D., Farcane, N. and Dontu, A. (2021). Managing change with and through blockchain in accountancy organizations: A systematic literature review. Journal of Organizational Change Management, 34(2), 477–506. Trkman, P., McCormack, K., De Oliveira, M.P.V. and Ladeira, M.B. (2010). The impact of business analytics on supply chain performance. Decision Support Systems, 49(3), 318–27. Tschakert, N., Kokina, J., Kozlowski, S. and Vasarhelyi, M. (2016). The next frontier in data analytics. Journal of Accountancy, 222(2), 58–63. Vasarhelyi, M. (2013). The emerging role of audit analytics: Internal audit should embrace data analytics. Retrieved from http://raw.rutgers.edu/node/89.html (accessed on 12 March 2022). Vasarhelyi, M. and Greenstein, M. (2003). Underlying principles of the electronization of business: A research agenda. International Journal of Accounting Information Systems, 4(1), 1–25. Vasarhelyi, M.A., Alles, M. and Williams, K.T. (2010). Continuous Assurance for the Now Economy. Sydney: Institute of Chartered Accountants in Australia. Vasarhelyi, M.A., Kogan, A. and Tuttle, B.M. (2015). Big data in accounting: An overview. Accounting Horizons, 29(2), 381–96. Verver, J. and Grimm, S. (2013). Integrating analytics into audit risk and compliance. Paper presented at the 28th World Continuous Auditing and Reporting Symposium, Newark, NJ. Wang, T. and Cuthbertson, R. (2015). Eight issues on audit data analytics we would like researched. Journal of Information Systems, 29(1), 155–62. Wixom, B.H., Yen, B. and Relich, M. (2013). Maximizing value from business analytics. MIS Quarterly Executive, 12(2), 111–23. Zhang, J., Yang, X. and Appelbaum, D. (2015). Toward effective big data analysis in continuous auditing. Accounting Horizons, 29(2), 469–76. Zhao, N., Yen, D.C. and Chang, I.C. (2004). Auditing in the e‐commerce era. Information Management & Computer Security, 12(5), 389–400.
4. The benefits of marketing analytics and challenges Madiha Farooqui
1. INTRODUCTION In the contemporary world we live in, the most successful businesses are those that use marketing tools to review and appraise data. The importance of having a solid understanding of marketing analytics cannot be overstated. What exactly is meant by “marketing analytics”? This concept refers to the techniques and approaches that provide the marketer with the ability to comprehend the pattern of performance based on the many different communication channels. After that, the marketer will assess the yield, the influence of marketing techniques, as well as the expediency of such methods. In a nutshell, marketing analytics assists in better comprehending the projection on the basis of many methodologies. Experimenting with different outputs based on different plotting tactics is only one of these ways, but it is not the only one. This concept of marketing analytics was used in the Mailchimp Marketing Glossary (Glossary), which describes marketing analytics as “a math-based discipline that aims to uncover a pattern in data to boost practical information.” In order to solve problems and provide answers, analytics makes use of methods such as statistical analysis, predictive modelling, and machine learning. The results of using analytics may be seen in anything from weather forecasts, to cricket batting averages, to life insurance plans. When it comes to digital marketing, analytics plays a crucial role in gaining knowledge of and making accurate predictions about user behaviour, as well as in maximizing the user experience (UX) in order to boost sales. Fuelled by an increase in commercial web usage by governments and individuals, there has been a rise in big data as a scientific and marketing field that collects, analyses and extracts informational and imperative value from massive amounts of corporate and customer online contact. Big data is a field that collects, analyses and extracts informational and imperative value (Wesley Chai, n.d.). According to Jonathan Shaw (2014), the amount of publicly available data sources and amorphous data is likewise quite large and wide, and it is rapidly increasing at an alarming pace. Shaw also makes an argument in favour of large amounts of data. Data is now pouring in from every conceivable source, including mobile devices such as cellphones, lines of credit, television sets, and processors; urban design, such as RFID skyscrapers; trains; automobiles; aircraft; roads; factories; and so on. Data is being generated at such a fast pace that the accumulation of it over the last two years in a central database has already surpassed the whole of human history (Shaw, 2014, p. 30). The article “The benefits of big data and using marketing analytics to analyze it” in Forbes Magazine (Whitler, 2015) praises the advantages of big data and the use of marketing analytics to analyse it. Marketing analytics is a term that was first proposed by Nichols (2013). It refers to a collection of characteristics that enable organizations to comprehend data and, as a consequence, measure the effect that marketing initiatives have on their bottom line. 52
The benefits of marketing analytics and challenges 53
2.
THE IMPORTANCE OF MARKETING ANALYSIS
When trying to determine how successful their campaigns are, “marketers” make use of information gained from customers. They must evaluate how successful their various methods of monetization are. When it comes to assessing the efficiency of marketing, statistics such as those pertaining to return on investment (ROI) and the impact of commercializing are essential (Marketing Evolution, n.d.). When you use marketing analytics, you collect data from all types of commerce that can be described and combine it into a single company view. From this perspective, you may extract analytical insights that provide crucial assistance in bolstering your commercial endeavours (SAS, n.d.). In recent years, there has been an explosion of new marketing categories, which has led to the development of new technology to facilitate these categories (Narisetti, 2020). Every aspect of today’s technology is managed almost exclusively in silos, which has led to the proliferation of several information ecosystems. As a consequence of this, businesses expanded into new areas of creative marketing, and as a consequence of this, new technologies were developed to assist these businesses. Because of this, a variety of data environments that are not linked to one another were produced as a result of the independent deployment of each new technology. For this reason, marketing businesses will typically make their decisions based on information obtained from a single source (such as the number of visitors to a website), but neglect to take into account the bigger picture of marketing. It is necessary to supplement the data obtained from social networks with data obtained from web analytics. Methods that are only capable of looking at the image of a single network in real time are thus utterly inappropriate. As a point of reference, predictive analysis takes into account all marketing actions over time and across all platforms. This is essential for making smart choices and executing programs in an efficient and successful manner.
3.
MARKETING ANALYSIS: WHAT ARE THE POSSIBILITIES?
We can obtain answers to queries like these by studying the data from marketing campaigns. In what specific ways does it seem that our marketing efforts may need some improvement? What are your thoughts on how well it will function in the long run? Are there any steps that we can take to make them better? What are our own marketing activities in comparison to the marketing efforts of our rival companies? What channels do they employ that we don’t, and why is that the case? Where does the majority of the company’s time and resources go? Are there any other actions that we should take to obtain good marketing results? How should we distribute the resources that we have available for marketing? Do we direct our time and resources toward the appropriate avenues? What are the most crucial investments to make in the next year? These are the most fundamental issues, and each and every company needs an answer to one of them. The goal of locating a response to these inquiries is to locate the most optimal and precise suggestions based on the marketing mix and models applicable to each sector of the market. It has been determined that segmentation is required since one model does not fit all (Klompmaker et al., 1976).
54 Handbook of big data research methods 3.1
Marketing Analytics: Three Steps to Success
If you want to get the most out of marketing analytics, following these three steps will assist you: 1. Make use of a varied and comprehensive set of analytical methods. 2. Improve any areas of your analytical ability that need improvement. 3. Recognize your shortcomings and adjust your behaviour appropriately. Because it takes into account facts from the past, the present and the future, a number of writers consider this method to be successful. In addition, Charles G. Jobs (2015) addressed the concept of big data as categorized by Crawford (2012) using the following three factors: (a) technology; (b) analysis; (c) mythology. Make use of a diverse collection of different analytical methods You need a well-rounded collection of marketing data, which means you need to figure out how to combine quantitative and qualitative research methods. Take a look at things from the past: you could get the answers to your questions by employing marketing analytics to research the past. For example, what aspect of the advertising brought in the most amount of money in the prior district? This guarantees the time of year when sales are often at their highest. Provide reports on and analyses of historical data: you are able to react to questions such as which “x” components made the most fantastic income in the previous quadrant by using advertising analysis to supply data about the past, which in turn enables you to use advertising analysis to provide information about the past. How was it possible for electronic post “t” to catch up to the level of the post-office-based post-office system “u”? How many channels did we create based on the journal entry “g” as opposed to the online channel “h”? Present: marketing investigation gives you the ability to determine how well your marketing initiatives are doing at the moment by providing answers to questions such as “how are our customers drawing in with us?” Which social media platforms do our most profitable customers favour? Who is talking about our reputation via various internet channels and communities, and what are they talking about? For instance, popular hashtags with a “#” prefix have been recognized as the most influential instrument in social media. Forecast and anticipate what will happen in the future, as well as have some say in shaping it: predictions based on data obtained from marketing analytics may be used to influence the future by providing answers to problems such as “how can we transform basic revenues into long-term commitment?” as well as ongoing interaction. How exactly would adding “x” additional sellers in regions that are currently failing affect the profitability of the company? With the assets that we now have, which metropolitan regions should we pursue next? Assess your capabilities as an analyst, and fill in the gaps in the puzzle Companies in the marketing industry have access to a wide range of analytical capabilities, which may aid them in achieving a number of marketing objectives. If your company is like most others, you probably do not cover all of your bases. The appropriate next step is to conduct an assessment of your current analytic abilities. In the end, it is very important to have a solid understanding of where you are in the analysis so that you can identify any gaps and devise a strategy to fill them. For example, an advertising and marketing firm could already be
The benefits of marketing analytics and challenges 55 collecting data from online and EFTPOS transactions. However, what about all the nebulous information that might be gleaned from social media sources or name-centre logs? These kinds of resources are a veritable treasure trove of information, and the time has come when abstract numbers may be transformed into actionable insights that business owners can put to use. As a result of this, a company that deals with public relations can also decide to plan and budget for the addition of analytical capabilities that can cover that particular need. Start by satisfying the criteria that are most pressing for you at the moment, then work your way outward from there as additional requirements emerge over time. Put what you’ve been practising into action If you don’t put into practice the insights that marketing analytics provide, you won’t be able to benefit from them in any meaningful way (SAS, n.d.). Marketing analytics enables you to improve the overall performance of your marketing programme by allowing you to do things such as: comprehend and classify network gaps; mould methodologies and approaches as required; improve procedures. A continuous process of testing and learning, marketing analytics enables you to do these things. If you don’t have the capacity to test and assess the performance of the marketing initiatives, you won’t know what aspects of the strategy are successful and which ones aren’t, even if you decide to make adjustments. Similarly, if you utilize marketing analytics to gauge performance but don’t do anything about the notion, then the work that you put in is wasted (SAS, n.d.). When used in its entirety, marketing research paves the way for stronger, more successful advertising by enabling you to complete the circle in terms of how it links with your marketing initiatives and guesses. For instance, marketing research may lead to improved lead maintenance and management, which can result in more revenue and enhanced productivity. It is possible to determine which specific advertising efforts are contributing to your major problem if you supervise leads in a more thorough manner and have the flexibility to connect deals to those leads. This kind of study is known as a “closed loop advertising investigation”. Today we live in a period in which there is an abundance of information and materials that may be obtained without cost (Spenner, 2012). You may use your mobile phone to get information about your exercise levels, your sleeping habits, and even your medical history while you are at home. At work, you may compile information about your customers by using text files known as cookies or caches, in addition to other similar instruments. You have the ability to customize almost any aspect of the information you want, from the kind of products that customers buy to the age groups that are most likely to visit your website. In addition to this, you may segment these details all the way down to the individual level, if that’s what it is that you want to do. An important piece of research that was published in the Harvard Business Review by Matthew Dixon (2010) demonstrates that if administrators concentrate on rebuilding their facilities, it would immediately result in a reduction in the number of customers that leave the company. According to Dixon (2010), it is likely to result in confusion, lost time and effort, and expensive giveaways if sales representatives are instructed to go above and beyond what is expected of them by consumers. In the end, it is not the knowledge itself that is crucial, but rather how you manage to keep track of it. The strength is in having knowledge and comprehending that information. The procedure is an iterative process that may also be called marketing analytics or advertising study.
56 Handbook of big data research methods The following are the two primary reasons for the development of marketing analytics: 1. To evaluate how effectively your advertising movement is working by measuring its adequacy in order to determine how well your promotional actions are operating. 2. To determine how you might enhance the outcomes of all of your marketing channels. These cycles, when used together, enable you to translate raw advertising information into a workable strategy and make the most of the money you spend on advertising. 3.2
Why do we Consider that Marketing Analytics is Significant?
Analytics goes beyond being a nice extra. Understanding your customers’ journeys and identifying what works and what doesn’t are both made possible by customer journey mapping. You’ll need this information to optimize your marketing efforts in the future. Here are a few instances of how marketing analytics may assist you with your business. 3.2.1 Make your assertions measurable The power of numbers cannot be overstated. You may tell your CFO that content attracts consumers, or that 72 per cent of marketers feel that content increases customer engagement, or something similar. The second option offers a greater probability of securing money for your project. When you include relevant numbers in your arguments, people are more likely to pay attention to what you are saying. You can only think in general terms if you don’t have specific marketing data points, such as your ROI before and after a campaign. During the time that a certain advertisement was airing, either your revenue climbed or fell. After you began utilizing pay-per-click (PPC) advertising, you either saw an increase in the number of people who signed up for your email list or you saw a decrease. Analytics enables you to examine data from that time period in order to determine how much income a certain campaign produced – in other words, the campaign’s marketing impact. How many of the 100 email opt-ins you obtained on the first day of your PPC campaign were generated by clicking on the ad in the first place? Getting money is much easier if you can determine whether or not your marketing strategy was successful. Additionally, you may save money by not continuing with the project if it does not yield positive results. Marketing analytics may be used to demonstrate not only whether or not something is effective, but also why it is effective. And it is via this “why” that you may be able to persuade customers to alter their habits (Williams, 2020). 3.2.2 Transform data into knowledge Customer data and web analytics technologies are now available to the majority of firms. The difference is whether or not your organization uses that information. In the words of the Harvard Business Review, it typically does nothing more than sit on a server, doing nothing particularly important. When utilized incorrectly, it has the potential to be misinterpreted and misused, which might lead to your marketing team being misled. It is only after you have subjected your data to suitable data analysis that it will be transformed into valuable knowledge. For example, your monthly revenues are around $10 000 while your PPC campaign is just getting started. After your first campaign, your monthly revenues might reach up to $15 000 each month. Should you put more money into that same ad (McCormick, 2022)?
The benefits of marketing analytics and challenges 57 3.2.3 Compare and contrast the marketing data you’ve collected Take it one step further with analytics and compare your data sets to one another for even more insight. Consider the following illustration: ● Was your revenue from sponsored search, social media marketing, and organic search consistent with your predictions or did it fall short? ● What if there were significant differences in the quantity of money earned by different demographic groups? ● How did the return on your PPC campaign compare to the return on your Facebook ad campaign? ● What proportion of your PPC campaign’s income comes from the first sale as opposed to the total revenue from all sales? Your advertising campaigns, content initiatives, and customer segments are all intertwined with one another. If you understand the intersections, you can clear out irrelevant information and make the best decisions for your company’s specific goals based on the knowledge you have. 3.2.4 Maintain a goal-oriented mind-set It is the goal of any marketing article you create to increase sales or just draw more attention to your company’s website. If you don’t know what that goal is, don’t worry. As you assess and utilize the information at your disposal, you will become more knowledgeable about your progress toward your goals. Marketing analytics enables you to measure your progress and discover the cause of any problems if things aren’t moving as rapidly as you’d like in your business. Consider the following scenario: you ran a Facebook ad campaign with a ROI of around 3:1. Your colleagues tell you to try something else, but you decide to look at the facts first. However, while your ad had a large number of clicks, the number of people who visited your webpage was extraordinarily high. The problem did not stem from the PPC campaign. You would never have known if it hadn’t been for the use of sophisticated analytics. 3.2.5 Use marketing tools to understand how your company grows Data is nothing more than a collection of numbers. The benefits accrue when you utilize that data to direct your marketing efforts toward what works and away from what doesn’t. 3.2.6 Customer data should be segmented You can access more specific (and meaningful) data by segmenting your client data based on specific traits or behaviours. You may divide your clients into groups based on any demographic aspect that has an impact on their behaviour. Here are just a few examples: ● Age cohort, ethnicity, academic achievement, annual salary, family circumstances, geospatial position, etc. (for example, Netflix: recommendations are simply an algorithm of age, sex, likes, etc). ● You may also sort data by customers who: abandon their shopping carts, peruse your product pages without making a purchase, purchase from you on a regular basis, or haven’t visited your website in a lengthy period of time. It is possible to filter your data based on relevancy using this segmentation, keeping only the information you need and discarding the rest.
58 Handbook of big data research methods ● Consider a large number of consumers who abandon their shopping carts at the checkout. You run a series of trials to see if sending them a Facebook message or an email is the most effective approach to get them back to their basket. Your research reveals that sending emails delivers a better return on investment, but only if the language used communicates a feeling of urgency. As a result of this understanding, you will not be tempted to pursue cart abandoners in locations where they are not looking. 3.2.7 Make certain you’re working with data of high quality Only high-quality data can be used to do effective analytics. Data from five years ago will not be valuable in your marketing endeavours this year, and if the data is incomplete, it may not be usable at all in your marketing initiatives. For data to be called high-quality, it must meet the following criteria: it must be current, complete, devoid of gaps, and error-free. As exact as the data analysis required, data should be analytically relevant. The last aspect is the most important: marketing analytics should always be goal oriented. If you manage outdated or partial data with care, it is possible to use it in your campaign, but, if the data isn’t relevant to your campaign’s needs, you should leave it out. 3.2.8 Consider the future The mere knowledge of what has occurred in the past, or even of what is occurring right now, is no longer adequate (although real-time analytics are important). Marketing success is dependent on the ability to predict what will happen in the future as well. However, you will not require a crystal ball to do this; predictive analytics may aid you in looking into the future. These applications make predictions based on particular facts and historical trends about what you could anticipate to happen in a variety of situations. Predictive analytics may assist you in answering queries such as: a. Would increasing the budget for a search engine campaign result in better results? b. Can a search marketing campaign on one platform be used on another? Would a Facebook ad campaign be similar to an email campaign? Is it possible that your Facebook ad would work on Instagram? c. What type of revenue may you expect from a new market’s social media marketing campaign? It is not necessary to be a computer scientist to use predictive analytics, but dedicating some time to using it can help you determine whether or not what you are doing is effective. Examine both what doesn’t work and what does work well. Even while it’s rewarding to concentrate on your assets and liabilities, don’t just stop there. You could learn just as much by paying attention to the areas in which your marketing efforts are falling short of expectations. What in your data suggests that a sudden drop in sales might be the cause of the problem? Examine where your content isn’t performing well, but remember to stay upbeat and focused on your goals. Treat your weaknesses as opportunities, and use your data analytics to figure out how to bridge the gap. Remember to continue collecting and evaluating data as you make adjustments so that you can notice when things are becoming better.
The benefits of marketing analytics and challenges 59 3.2.9 The main points to remember Any marketing campaign is a series of steps. Marketing analytics can assist you to figure out where you should focus your efforts during the process and what mix is best for your company. Keep the following recommended analytics practices in mind: ● ● ● ●
Make a list of the questions that need to be answered; Obtain high-quality information; Select the information that is relevant to you; Think about the past, present, and future.
The answers you get can sometimes leave you with even more questions. You can then repeat the analysing procedure to discover even more about your marketing efforts. In marketing analytics, collected data from marketing operations is analysed to find patterns, such as how a campaign affects customer behaviour, and the effectiveness of creative ideas as well as consumer preferences is used to study all of these topics. Marketers that utilize marketing analytics in their daily work try to take advantage of what has worked in the past and apply that knowledge to their current efforts. Marketing analytics is beneficial to both marketers and their customers. By looking at what works best in terms of conversions, brand recognition, or a mix of the two, this study aids marketers in increasing the return on their marketing expenditures. Ads that are tailored to a consumer’s unique requirements and interests are made possible by data analytics, rather than the annoying bulk messages that are now available. Depending on the KPIs being monitored, a variety of approaches and models may be used to analyse marketing data. Consider brand awareness research, which makes use of data and models that are distinct from conversion research. ● Media Mix Models (MMM): An attribution model and methodology are two of the most used in the analytics sector, which evaluates aggregate data over a long period of time (Stoltz, n.d.). ● Multi-Touch Attribution (MTA): Attribution models for the buyer’s trip that provide data at the individual level from several points of contact along the buyer’s journey (Rabinovitch, 2020). ● Unified Marketing Measurement (UMM): Measuring the level of engagement via the use of several attribution models, such as MMM (media-mix modelling) and MTA (multi-touch attribution), into a single set of engagement measurements contributes to multi-attribution modeling (MMM) (Athreya, 2021) In today’s marketing environment, accurate numbers are now more crucial than ever before. In recent years, customers have become more picky about which branded materials they connect with and which ones they choose to ignore. Rather than relying on broad demographic associations, if a company wants to attract the attention of an ideal customer, it must use analytics to produce personalized marketing based on individual interests rather than broad demographic correlations (John, 2018). Customers will be moved up the sales funnel by marketing teams being able to show the relevant ad to the appropriate consumer at the appropriate time and on the appropriate channel.
60 Handbook of big data research methods
4.
MARKETING ANALYTICS IN ORGANIZATIONS
Marketing analytics data may help your organization make important choices regarding product updates, branding, and other aspects of your business. To prevent having a disjointed view, data from several sources (online and offline) should be used (Wedel, 2016). Your team can learn more about the following topics by analysing this data. 4.1
Intelligence about Products
Product intelligence necessitates a thorough analysis of the brand’s goods and how they compare to those of its rivals on the market. Having conversations with consumers, polling target groups, and to better understand their products’ unique selling points, surveys could ask for them to be included. In this way, teams may better personalize their products and services tailored to meet the particular needs and interests for each and every one of their clients, resulting in a greater rate of conversion (Nicovich, 2007). 4.2
Customer Interests and Patterns
Analytics may tell a great deal about your clients’ preferences and behaviours. What type of messaging or images do they respond to? What are their preferences? Which items do they intend to purchase and which have they previously investigated, are the questions. Which advertisements generate conversions, and which ones go undetected by the public? (Court, 2015). 4.3
Trends in Product Development
Additionally, analytics can tell which product characteristics clients find most appealing. This information may be handed on to product development in order for marketing teams to make future adjustments to the product (Porter, 2015). 4.4
Customer Support Analytics
Using marketing analytics, you may also discover ways to improve the customer’s experience throughout the buying process. What are the specific areas in which your customers are experiencing difficulties? Is there anything you can do to make your product more user-friendly or to make the checkout process more efficient (Gruner, 2021)? 4.5
Media and Messaging
Marketers may utilize data analysis to decide where they want to show messages to certain clients based on their demographics. This is even more important considering the wide range of options presently accessible to consumers. Additionally, marketers must know which digital platforms and social media networks their target audience prefers, in addition to the conventional media such as print, television and radio. These basic questions may be answered using analytical approaches: in which medium should you invest your hard-earned dollars? Which
The benefits of marketing analytics and challenges 61 marketing strategies are the most effective in terms of boosting sales volume? What part of your message is getting over to your intended audience? 4.6 Competition What is the relative effectiveness of your marketing activities in comparison to those of your competitors? What is the best way to close a gap if there is one? Does it seem as if there are any opportunities that your rivals are taking advantage of that you aren’t seeing? 4.7
Predict Future Outcomes
You’ll be able to use what you’ve learned about why a campaign succeeded in the past in future initiatives for a higher return on investment.
5.
THE DIFFICULTIES OF DATA ANALYSIS
Even while marketing analytics are necessary for the development of efficient campaigns, the analytical process is made more difficult by the large quantity of data that marketers now have access to. To achieve actionable insights, marketers must find out how to arrange data in a way that is simple to comprehend. The following are some of today’s most significant marketing analytics challenges (Weathers, 2019). 5.1
Data Quantity
Large amounts of data have been available to marketing teams throughout the digital age, allowing them to trace every click, impression, and view from consumers. Unless this massive amount of data can be classified and evaluated for insights that can be utilized to enhance campaigns in real time, it is pointless (Branca, 2020). As a result, marketers are dealing with the best way to arrange data to assess its meaning. For experienced data scientists, a study found that they spent more time dealing with data than they did studying it (Davenport, 2017). 5.2
Data Quality
Not only is there an issue with the large amount of data that enterprises must filter through, but also this data is frequently seen as untrustworthy. According to Forrester Consulting, poor data quality resulted in a waste of 21 per cent of respondents’ media budgets (Redman, 2016). One dollar out of every five was being squandered, according to this calculation. These costs might add up over the course of a year and result in budget waste of up to $16.5 million for medium and large-scale businesses, respectively. To ensure that workers can make the best choices possible, organizations need a way to ensure that their data is accurate at all times.
62 Handbook of big data research methods 5.3
Many Data Scientists are Lacking
Even if a company has the relevant data, it often lacks the appropriate staff. Only 1.9 per cent of organizations believe they have the people necessary to make successful use of marketing data, according to a study done by the CMO (Moorman, 2018). 5.4
Attribution Model Selection
Finding the proper model to deliver the relevant insights might be difficult. Multi-touch attribution, for example, and media mix modelling, both of which use the same data, provide quite different results in terms of the insights they provide – aggregate campaign-level data and individual customer data, respectively. The types of insights marketers receive will be determined by the models they use. When it comes to choosing the correct model, analysing engagement across so many platforms can be confusing (Gaur and Bharti, 2020). 5.5
Data Correlation
Additionally, when marketers acquire data from a range of sources, they must figure out how to standardize it so that all of the data can be used in the same way. Constructing meaningful comparisons between online and physical interactions is particularly challenging since they are frequently measured using different attribution models (France and Ghose, 2019). By bringing multiple data sources together, in this case, the usefulness of unified marketing measurement and marketing analytics technology is shown.
6.
HOW TO GET STARTED WITH MARKETING ANALYTICS
Here are four actions to follow at the start of your analytics programme if you want to improve your analytics capabilities. 6.1
Know What you Want to Measure Before you Start
Conversion rates lead acquired, and brand recognition are just some of the characteristics of a marketing campaign that may be measured (Higa, n.d.). To get the most out of your data, make sure you know the problem that you need to address or the insight you’re seeking to get before you begin your research. 6.2
Create a Benchmark
What factors contribute to a successful campaign? Jeffery defines formalized benchmarks referring to case studies using keywords such as branded and nonbranded and its contribution towards sales at the end of the day (Jeffery, 2017). This will have an impact on the sorts of data and analytics obtained by marketers in the future. There are several factors that can be used to judge if an advertising campaign has been a success, but one of the most important is how well the brand is recognized by its customers.
The benefits of marketing analytics and challenges 63 6.3
Examine your Current Abilities
What is your company’s current status? What are some of your stumbling points? Understanding these problems can assist you in making improvements to your programme, whether you’re evaluating offline advertising performance or selecting the media most likely to convert (McKinsey & Company, 2020). 6.4
Install a Marketing Analytics Programme
Marketing analytics will become more crucial as customers get more selective and datasets grow in size. Marketing measurement and optimization systems, such as our Marketing Measurement and Optimization Platform, may help marketers find messages that resonate and the kinds of media that lead to conversion. Thus, in real-time, you get a clear view of the initiatives that work and those that don’t. To What Extent is the Software Designed? The solution to these problems is marketing analytics software, which collects, organizes, and correlates important data in real time, enabling real-time campaign modifications by marketers. As a result of their ability to analyse large volumes of data quickly, modern marketing tools are valuable. Because the data is so plentiful, as a result, marketers will never be able to make real-time modifications since they will never be able to comprehend all of the information at once. Because of this, marketers may make adjustments to content and ad placement before the campaign ends, increasing their ROI; advanced analytics tools can help with this (Stitch, n.d.). To further simplify data analysis, several systems now use unified marketing measurement to standardize and consolidate marketing data from many channels and campaigns. Finally, contemporary analytics technology may provide insights into brand equity and how certain audience groups react to creative elements, among other things, in addition to monitoring customer interactions. Marketers may use this information to calculate the ROI in brand development as well as how to personalize branded experiences relating to their customers’ specific needs.
7.
WHAT ARE THE FEATURES AND CAPABILITIES OF MARKETING ANALYTICS SOFTWARE?
When developing a marketing measurement system, keep the following features and capabilities in mind: ● ● ● ● ● ●
Analytics and insights delivered in real time; Capabilities for brand measuring and evaluation; Contextualized customer and market insights; Year’s annual media plan recommendations; Data that is exclusive to a single individual; Another essential aspect is the ability to link attribution metrics from online and offline sources.
64 Handbook of big data research methods Managers of marketing analytics should have the following skills. When it comes to conducting quality analysis that leads to more engaging and successful campaigns, marketing teams should hire analytics managers who can (Friedman, 2021): ● Conduct quality analyses: Starting with past expertise analysing big data sets to extract insights such as purchase behaviours and interaction trends within the target audience, an analytics manager is required. ● Make recommendations for optimization: Once you’ve gathered data insights, you’ll need to be able to make recommendations based on trends to enhance ineffective campaigns. A consumer’s drive home is a better time to deliver an ad than their morning commute if research suggests that they are more likely to interact with branded content at night. ● Be aware of consumer and marketing technology trends: Analytics managers should be on the lookout for new advancements in consumer and marketing technologies. Understanding consumer preferences for a smooth omnichannel experience, as well as how purchasers interact with augmented and virtual reality, will undoubtedly help determine the next stages for optimization potential. ● Utilize analytical tools in their work: Finally, it is necessary for analytics managers to be trained in and acquainted with a wide range of automation technologies and analytics platforms. In order to reduce the time it takes from customer touch to consumer insight, these technologies are crucial. Stakeholders should be involved in the process. When it comes to telling a compelling narrative to stakeholders, members of the analytics team must also explain how their findings may be applied to other areas, such as sales or product development. We’ve all heard of the popular psychology hypothesis that divides people into left-brained and right-brained individuals. True, the left half of the brain is in charge of analytics, reasoning and language, while the right is in charge of creativity and visual comprehension. However, because of advances in neuroscience, we now know that when both sides of the brain work together, the ability to do any activity – whether creative or analytical – is strongest. The same may be said for your marketing department.
8.
THE IMPORTANCE OF USING ANALYTICS AND CREATIVITY TOGETHER
Marketing teams, like the human brain, frequently have an analytical, data-driven side and a creative, visual side (Tamm et al., 2021). When both parties work together, performance and outcomes can soar to new heights. This is especially true in today’s environment, as customers are becoming increasingly adept at shutting out non-resonant marketing messaging. That’s why one of the most crucial questions to ask when creating a campaign is, “Is this message relevant to my target audience?” To ensure that the response is yes, creative teams must extensively rely on data from their analytics team and marketing analytics tools to guide message and visual components.
The benefits of marketing analytics and challenges 65 8.1
Creative Teams Benefit from Marketing Analytics Software
Marketers must thoroughly analyse data at a granular level and scale the insights gained from that data at machine speed to deliver the highly personalized messaging that customers desire. An advanced marketing analytics platform can be a game-changer in this situation (Tamm et al., 2021). After the analytics team has normalized, correlated and analysed the engagements from each campaign across channels for insights, the data may be handed to the creative team to inform and identify messaging. There are numerous essential approaches for creative and analytics teams to work together to gain insights into messaging and ultimately optimize campaigns. 8.2
Insights at the Individual Level
Person-level insights are the driving force behind successful marketing strategies. Correlating and scaling user-level attribution data, as well as analysing each interaction by touchpoints across channels, yields these insights. This provides advertisers with essential interaction data – such as channel, message and time – to develop personalized advertising that is targeted to specific consumers. This is particularly crucial in the creative process. On paper, two people may appear to be the same in terms of demographics, but they will react to copy and visual components differently. Person-level data goes beyond basic consumer personas, allowing creative teams to design much advertising for the same product that appeals to different target consumers based on their interests and requirements. 8.3
Insights at a Glance
Creative teams can acquire real-time data updates on message performance by collaborating with the analytics team. Rather than waiting until the next campaign to maximize communications, this allows for continuous course correction (Artun and Levin, 2015). Because while marketing teams conduct creative pretesting to evaluate how a message will work, it is not the same as real-world ad exposure, which is an important component of the messaging process. There are often discrepancies between what is tested and what is observed in the field. Furthermore, marketers will not lose out on time-sensitive adjustments as a result of this. Consider the following example: Following the campaign’s launch, a marketing analytics team at a film studio examines its efficacy and discovers that none of the advertisements appealed to a certain demographic. Without the opportunity to engage this audience, the film was released, resulting in a squandered opportunity. 8.4
Interest or Population Segmentation
Advanced analytics tools also enable creative teams to work backward, producing content first or re-purposing previous work, and determining the best section to target via heat mapping. They’ll be able to offer this creativity to the right audiences on the right channels at the right time. Heat mapping, on the other hand, allows creatives to begin by evaluating the interests of a certain audience and then create content and images that speak to those interests.
66 Handbook of big data research methods 8.5
Identifying Relevant, Underserved Audiences
In an ideal world, creative and analytics teams collaborate to uncover target audiences’ interests and create ads tailored just for them. When analysing campaign metrics, however, the analytics may suggest that more audiences aren’t being targeted right now. Let’s imagine, even though the product has no intrinsic gender skew, a car manufacturer has no messaging aimed at women. Creative teams can use statistics to identify this underserved population and produce new advertising just for them. 8.6
Future Impressions
It’s easy to conceive of analytics and creativity as two distinct fields. Collaboration is necessary for either team to be successful. It’s not enough for the analytics team to use data from previous engagements to predict where and when an ad should appear, then hand it off to the creative team to create appealing messages for that medium. The best, most ideal results are achieved when both teams collaborate to design creative campaigns based on precise data that reveals what will cause a customer to stop what they’re doing and engage with the message.
9.
MARKETING DATA COMES FROM WHERE?
Gathering and storing data is the first step before it can be utilized to monitor progress toward objectives, get consumer insights and make strategic choices. Customer data is divided into three categories: first-party, second-party, and third-party (Shewan, 2021). First-party data is obtained directly from the consumers by your company: from the individuals who use your service, your organization collects first-party data directly and openly. Second-party data is information about your customers shared by another firm (or its first-party data). You may benefit from a collaborative campaign, collaboration, or audience types that are similar or have similar demographics. Third-party data is information gathered, rented, or sold by businesses that is unrelated to your company or consumers. In spite of the fact that third-party data might give information on users who are similar to yours, it doesn’t originate from your customers or a trusted second-party source. Data from second and third parties is important, but the most reliable source is that which comes directly from your customers and reflects their actual thoughts, feelings and actions. First-party data may be gathered in a variety of ways. 9.1 Surveys In order to learn about your customers’ experiences with the product, their reasons for acquiring it, and whether or not they would suggest it to a friend, run a survey. With so many choices, the sky’s the limit! Surveys may be anything from a simple pop-up to a series of multiple-choice questions on a user’s experience on a website (Business Queensland, 2021).
The benefits of marketing analytics and challenges 67 9.2
A/B Tests are used to Compare Two Options
The purpose of an A/B test is hypothesis testing by comparing user interactions with a modified version of your website or product to those with an unaltered version of either. Consider the following scenario: if you believe that changing the colour of a button on your website from red to blue will increase the likelihood of people clicking on it, running an A/B experiment in which half of your consumers see a red button and the other half a blue button (the group acting as the control) is possible. The purpose is to not to increase the traffic on your webpage but to increase efficiency. Your idea would be proven or disproved by the evidence acquired from the interactions between the two groups. A/B tests are a terrific way to put ideas to the test and collect behavioural data. 9.3
Interaction of Organic Content
There are several ways to monitor and analyse organic material like blogs and downloadable offer emails and social media postings to learn about a user’s motivation for making a purchase as well as their stage in the marketing funnel and what they’re looking for. 9.4
Interaction with Paid Advertisement
You have the option of tracking the number of people who have engaged with your digital advertisement if you have paid for it to be shown on another website, at the top of search results, or as a sponsor of the content produced by another business. This information is essential for figuring out where your consumers are coming from and where they are in the sales funnel when they first encounter your adverts in order to make a purchase from your company (Treanor, 2022). 9.5
What Methods are Used to Examine Marketing Data?
Marketing data must be pooled and formatted before it can be analysed because there are so many different sorts and sources. You can do so on a variety of platforms, including: ● ● ● ● ● ●
Analytica Google; HubSpot; Social Sprout; SEMRush; MailChimp; Datorama.
Several of these platforms may be used to conduct analysis and extract crucial insights using algorithms in addition to tracking and aggregating data. You may do a manual review of the data by exporting the information into Microsoft Excel or another statistical program, making visual representations using tools for graphing and charting, and conducting regression and other analytical tests. If you are able to gather, consolidate and analyse data, you will be able to get significant insights, which you can then use to have a data-driven influence on your firm.
68 Handbook of big data research methods 9.6
Make the User Experience Better
The first-party data that is obtained and processed by your users may show how they feel about their experiences with your product and website. It is imperative that you maintain the confidentiality of this information. If your company has access to both qualitative and quantitative data, it will be able to make adjustments that will allow it to better meet the requirements of its leads and increase the likelihood that those leads will become customers. These adjustments can be made regardless of whether the leads’ feelings are expressed explicitly (for instance, in a survey) or implicitly (for instance, in their behaviour) (for example, leaving the website shortly after loading the page). 9.7
Determine the Marketing Effort’s Return on Investment
Calculating the monetary benefits that may be ascribed to certain marketing channels or campaigns is another significant purpose of marketing analytics. Use the formula that is given below to calculate the return on investment for a certain marketing campaign in order to determine the return on investment: (Net Profit / Investment Cost) × 100 = Return on Investment Consider the case when you release a $1000 video explaining the benefits of your product. You keep track of how many visitors go to your website’s product page right after seeing a video and discover that it resulted in 30 new clients in a short period. If your product has a price, $50, and each new lead purchases one, the video is responsible for $1500 in sales. In this situation, the net profit is $500. When you plug this into the ROI formula, you get the following: Return on Investment (ROI) = ($500 / $1,000) × 100 50 per cent ROI = (0.5) × 100 per cent ROI When there is a ROI, the marketing effort, which in this example is the video, is called profitable. It would be hard to calculate the financial effect of such operations without first gathering information about where the leads are coming from. ROI calculations can be used to assess which marketing initiatives result in the most sales and to demonstrate the value of projects. 9.8
Make Marketing Plans for the Future
By providing you with a deeper comprehension of your clientele and the ability to monitor the ROI of your various marketing endeavours, marketing analytics makes it possible for your business to develop plans that are driven by data. You may enhance your company by using marketing data to find out what is working and what is not working, as well as how your customers feel about their encounters with your product and website. This information can help you improve your business. You may also get a comprehensive view of the influence of your marketing operations on your company’s bottom line by using this tool. You may use this information to make plans for the future. What should you do more to meet your quantitative objectives? Which endeavour did not result in the generation of new leads and should be eliminated from plans? These types of queries can be answered with the help of data analytics.
The benefits of marketing analytics and challenges 69 9.9
Developing Your Analytical Marketing Skills
A win–win situation is when you can use data to make strategic decisions while also feeling confidence in your marketing efforts. Consider other people’s perspectives, engage in activities like brain teasers and strategy games on a regular basis, or work with data to hone your analytical abilities. It might also be something more formal and purposeful, such as enrolling in an online analytics course, with the goal of advancing your career by increasing your expertise and network of people working in the analytics industry.
10. CONCLUSIONS To summarize, a great deal has changed since the 1960s and 1970s, when the first marketing visualization and mapping research studies were carried out. Throughout history, computers have become more powerful, databases have become more complicated, and data analysis methodologies have advanced in sophistication. While there are certain research similarities that have lasted throughout time, there are also significant study differences. When it comes to visualization, there will always be a fusion of art and science to achieve success. While quantitative tools can assist in the understanding of data, managerial acumen is still necessary in order to make business decisions based on visualizations of information. Visualization storytelling, a technique in which visuals are pieced together to convey a tale, has an aspect of art to it in the way the images are assembled. All parts of marketing entail a trade-off between art and science in order to be successful. When pointed out in the 1960s that while marketing researchers had produced scientific theory, its execution would need “art” based on practitioner experience, marketing scholars had failed to recognize this. This has resulted in an ongoing debate over the last 30–40 years about the tension and trade-off that exists between art and science in the marketing field (Brown, 1996). In this way, the marketing technique is characterized by the use of visual representations.
REFERENCES Artun, O. and Levin, D. (2015). Predictive Marketing: Easy Ways Every Marketer Can Use Customer Analytics. Hoboken, NJ: Wiley. Retrieved from https://books.google.com.au/books?hl=en&lr=&id =yfk-CgAAQBAJ&oi=fnd&pg=PR9&dq=Creative+teams+can+acquire+real-time+data+updates+ on+message+performance+by+collaborating+with+the+analytics+team.+Rather+than+waiting+ until+the+next+campaign+to+maximize+communi. Athreya, G. (2021). Case 10. Unified Marketing Measurement – a marketing effectiveness story, 17 March. Retrieved from Linkedin: https://www.linkedin.com/pulse/case-10-unified-marketing -measurement-big-data-story-guha-athreya. Branca, M. (2020). The influence of big data in digital marketing. OVIOND. Retrieved from https://www .oviond.com/the-influence-of-big-data-in-digital-marketing. Brown, S., (1996). Art or science?: Fifty years of marketing debate. Journal of Marketing Management, 2, 243–7. Business Queensland (n.d.). Planning and conducting market and customer research. Queensland Government. Retrieved from Business Queensland: https://www.business.qld.gov.au/starting -business/planning/market-customer-research/market-research/methods. Chai, W., Labbe, M. and Stedman, C. (n.d.). Big data analytics. Retrieved from https://www.techtarget .com/searchbusinessanalytics/definition/big-data-analytics.
70 Handbook of big data research methods Charles G. Jobs, D. M. Jobs, C.G., Gilfoil, D.M. and Aukers, S.M. (2015). How marketing organizations can benefit from big data advertising analytics. Academy of Marketing Studies Journal, 20(1). Retrieved from https://www.abacademies.org/articles/amsjvol20no12016.pdf. Chhavi Arora, T. C. (2020). The Recovery Will Be Digital. Raju Narisetti Publisher McKinsey Global Publishing. Retrieved from https://www.mckinsey.com/~/media/mckinsey/business%20functions/ mckinsey%20digital/our%20insights/how%20six%20companies%20are%20using%20technology %20and%20data%20to%20transform%20themselves/the-next-normal-the-recovery-will-be-digital .pdf. Court, D. (2015). Getting big impact from Big Data. January. Retrieved from https://www.mckinsey .com/~/media/McKinsey/Business%20Functions/Marketing%20and%20Sales/Our%20Insights/ EBook%20Big%20data%20analytics%20and%20the%20future%20of%20marketing%20sales/Big -Data-eBook.ashx. Davenport, T.H. (2017). What’s your data strategy? Harvard Business Review. Retrieved from https:// hbr.org/2017/05/whats-your-data-strategy. Dixon, M., Freeman, K. and Toman, N. (2010). Stop trying to delight your customers. Harvard Business Review, August. Retrieved from https://hbr.org/2010/07/stop-trying-to-delight-your-customers. France, S.L. and Ghose, S. (2019). Marketing analytics: Methods, practice, implementation, and links to other fields. Science Direct, 119, 456–75. Retrieved from https://doi.org/10.1016/j.eswa.2018.11.002. Friedman, H. (2021). Top 16 marketing analytics tools and software for 2022. Improvado. Retrieved from https://improvado.io/blog/marketing-analytics-tools. Gaur, J. and Bharti, K.D. (2020). Attribution modelling in marketing. Academy of Marketing Studies Journal, 24(4). Retrieved from https://www.researchgate.net/profile/Kumkum-Bharti-2/publication/ 346031333_ATTRIBUTION_MODELLING_IN_MARKETING_LITERATURE_REVIEW_AND _RESEARCH_AGENDA/links/5fb7568e92851c933f42b56e/ATTRIBUTION-MODELLING-IN -MARKETING-LITERATURE-REVIEW-AND-RESEARCH-AGENDA.pdf. Gruner, R.L. (2021). 4 strategies to simplify the customer journey. Harvard Business Review, 14 May. Retrieved from https://hbr.org/2021/05/4-strategies-to-simplify-the-customer-journey. Higa, H. (n.d.). The 9 goals to consider when creating a marketing strategy. Retrieved from https://blog .hubspot.com/marketing/goals-of-marketing. Jeffery, M. (2017). Data-Driven Marketing: The 15 Metrics Everyone in Marketing Should Know. John Wiley & Sons. Retrieved from https://books.google.com.au/books?hl=en&lr=&id=rxc4DwAAQBAJ &oi=fnd&pg=PA13&dq=What+constitutes+an+effective+campaign%3F+This+will+influence+the+ types+of+data+and+metrics+gathered+by+marketers.+If+the+goal+is+to+raise+brand+awareness,+ the+success+metric+co. John, T.K. and Barasz, K. (2018). Ads that don’t overstep. Harvard Business Review, January–February. Retrieved from https://hbr.org/2018/01/ads-that-dont-overstep. Klompmaker, J.E., Hughes, G.D. and Haley, R.I. (1976). Test marketing in new product development. Harvard Business Review. Retrieved from https://hbr.org/1976/05/test-marketing-in-new-product -development. Mailchimp (n.d.). The Mailchimp marketing glossary. Marketing Analytics. Retrieved from https:// mailchimp.com/marketing-glossary/#markeSting-analytics. Marketing Evolution (n.d.). What is marketing analytics? Retrieved from https://www.marketingevolution .com/marketing-essentials/marketing-analytics. McCormick, K. (2022). How much does Google ads cost in 2022? WordStream, 25 February. Retrieved from https://www.wordstream.com/blog/ws/2015/05/21/how-much-does-adwords-cost. McKinsey & Company (2020). The Next Normal: The Recovery will be Digital. McKinsey Global Publishing. Retrieved from https://www.mckinsey.com/~/media/mckinsey/business%20functions/ mckinsey%20digital/our%20insights/how%20six%20companies%20are%20using%20technology %20and%20data%20to%20transform%20themselves/the-next-normal-the-recovery-will-be-digital .pdf. Moorman, C.F. (2018). Why marketing analytics hasn’t lived up to its promise. Harvard Business Review, 30 May. Retrieved from https://hbr.org/2018/05/why-marketing-analytics-hasnt-lived-up-to -its-promise. Nichols, W. (2013). Advertising analytics. Harvard Business Review.
The benefits of marketing analytics and challenges 71 Nicovich, S.G., Dibreli, C. and Davis, P.S. (2007). Integration of value chain position and Porter’s (1980) competitive strategies into the market orientation conversation: An examination of upstream and downstream activities. Journal of Business and Economic Studies, 13(2). Retrieved from https://d1wqtxts1xzle7.cloudfront.net/37437158/Valve_Chian_and_Poter-with-cover-page-v2 .pdf?Expires=1646367717&Signature=Mq62RUdajryTbdKsagFy-cakrSBhocdqXHL9jO04uIQV ~mkP7xHd2CVVL0k4Uops4FV8qXnQpuQRA26aPAOuVfGwKG~6B3mUUUcpHXtuBWL1 LNAx7fo7mb-3JbhjEQ10j40Tw-. Porter, M.E. and Heppelmann, J.E. (2015). How smart, connected products are transforming companies. Harvard Business Review, October. Retrieved from http://www.knowledgesol.com/uploads/2/4/3/9/ 24393270/hbr-how-smart-connected-products-are-transforming-companies.pdf. Rabinovitch, J. (2020). Multi-touch attribution: An exploration and visual essay in marketing analytics. Thesis. Retrieved from https://diginole.lib.fsu.edu/islandora/object/fsu:746628/datastream/PDF/view. Redman, T.C. (2016). Bad data costs the U.S. $3 trillion per year. Harvard Business Review, 22 September. Retrieved from https://hbr.org/2016/09/bad-data-costs-the-u-s-3-trillion-per-year. SAS (n.d.). Marketing analytics: What it is and why it matters. Retrieved from https://www.sas.com/ en_au/insights/marketing/marketing-analytics.html. Shaw, J. (2014). Why “Big Data” Is a Big Deal. Harvard Magazine. Retrieved from: https://www .harvardmagazine.com/2014/03/why-big-data-is-a-big-deal. Shewan, D. (2021). 7 sources for marketing data to make your content more convincing. WordStream, December. Retrieved from https://www.wordstream.com/blog/ws/2015/07/14/marketing-data. Spenner, P. and Freeman, K. (2012). To keep your customers, keep it simple. Harvard Business Review. Retrieved from https://hbr.org/2012/05/to-keep-your-customers-keep-it-simple. Stitch (n.d.). Marketing analytics: Definition and uses. Retrieved from https://www.stitchdata.com/ resources/marketing-analytics/#:~:text=Marketing%20analytics%20tools%20improve%20lead,into %20customer%20behavior%20and%20preferences. Stoltz, N. (n.d.). What is Marketing/Media Mix Modeling (MMM)? Measured. Retrieved from https:// www.measured.com/faq/what-is-marketing-media-mix-modeling-mmm. Tamm, T., Hallikainen, P. and Tim, Y. (2021). Creative analytics: Towards data-inspired creative decisions. Information Systems Journal. Retrieved from https://onlinelibrary.wiley.com/doi/abs/10.1111/ isj.12369. Treanor, T. (2022). 12 essential data sources to understand the customer journey. Treasure Data, March. Retrieved from https://blog.treasuredata.com/blog/2019/03/06/12-essential-data-sources-to -understand-the-customer-journey/. Weathers, D. and Aragon, O. (2019). Integrating analytics into marketing curricula: Challenges and effective practices for developing six critical competencies. Marketing Education Review, 266–82. Retrieved from https://www.tandfonline.com/doi/full/10.1080/10528008.2019.1673664?scroll=top& needAccess=true. Wedel, M. and Kannan, P.K. (2016). Marketing analytics for data-rich environments. Journal of Marketing, November. Retrieved from https://journals.sagepub.com/doi/abs/10.1509/jm.15.0413. Whitler, K.A. (2015). The benefits of big data and using marketing analytics to analyze it. Forbes, 19 July. Retrieved from https://www.forbes.com/sites/kimberlywhitler/2015/07/19/how-some-of-the -best-marketers-are-using-analytics-to-create-a-competitive-advantage/?sh=5131895d45c5. Williams, G.A. and Miller, R.B. (2020). Change the way you persuade. Harvard Business Review. Retrieved from https://hbr.org/2002/05/change-the-way-you-persuade.
5. How big data analytics will transform the future of fashion retailing Niloofar Ahmadzadeh Kandi
1. INTRODUCTION Businesses, particularly fashion companies, are becoming more digital due to the advent of the Internet, mobile devices, social networks and technologies, and fashion companies can easily access data generated by these sources of information such as click-through rates, customer feedback on social media, search history, and so forth. Big data analytics can be employed for diverse purposes, including sales forecasting, identifying patterns and trends, product personalization services, optimization and automation, marketing trends and more. Moreover, due to the immense competition in this industry and the existence of abundant data from various sources, the fashion life cycle has become relatively short, and customer demand is constantly changing; therefore, organizations need to update their strategies based on demands in the market, so that they can ensure their survival in conditions of uncertainty (Thomassey and Zeng, 2018). Thus, with the evolution of data, big data analytics, as a new method in data analytics, has emerged and become more popular in fashion retailing in recent decades, providing intelligent and sophisticated data analysis (Silva et al., 2019), based on 4Vs: volume, velocity, variety and value, and through which organizations gain insights towards data (Acharya et al., 2018). Most research on big data in the retail and fashion industries has focused on customer insights to forecast demands and how to implement marketing activities more effectively. Although accurate marketing can generate demand, the role of operational support for turning demands into sales cannot be ignored (Ma and Sun, 2020). Therefore, in this study, we intend to examine the impact of big data on the future of retailing in fashion, and the main question of this research is: how the use of big data can transform the future of fashion retailing from a marketing perspective. To answer this research question, we will first identify the potential applications of big data in the fashion industry through a literature review, and then present a framework for the application of big data from a marketing perspective through Big Data Value Chain (BDVA) in the fashion industry.
2.
WHY BIG DATA ANALYTICS
Different statistical approaches such as time series analysis and regression analysis to predict sales and apparel and textile production decisions (Jelil, 2018) and so on, have been widely used in the fashion industry so far, and are among the well-explored approaches used in linear relationships between variables to analyse structured data (Xue et al., 2018). But due to the different nature of the data created in today’s world, which is obtained from different methods and sources such as text, image, video, customer review, social media comments, 72
How big data analytics will transform the future of fashion retailing 73 and so on, the data is structured, semi-structured and unstructured. Consequently, the use of classical tools to interpret semi-structured and unstructured data is very challenging (Tan et al., 2015) since the nonlinear nature of the variables prevents traditional tools from providing valuable information. An important point in classical methods is that the distribution and size of the dataset have prerequisites, for example, when the dataset is too small to predict sales of a new product. However, traditional approaches cannot be used effectively for analysis when the dataset is too large or there is a lack of historical sales data, therefore, the prediction will be challenging (Tan et al., 2015). Also, the integration of data for analysis, which was performed at specific time intervals such as weekly or monthly, can no longer meet the needs of real-time data insights. Thus, if an organization cannot use the high volume of data generated in a real-time manner, part of its value will be lost (Liu et al., 2014). As data analytics techniques have advanced rapidly, the management of data-driven organizations is undergoing an unprecedented evolution. However, most fashion industry organizations do not utilize the new approaches to achieve success (Acharya et al., 2018). The availability of advanced data analysis tools as an open source still poses challenges for organizations due to the need for advanced technology skills (Akter et al., 2016). In the fashion industry, big data analytics can solve the problem of predicting customer behaviour, which leads to determining apparel trends and identifying the right product assortment, eventually leading to effective allocation of marketing resources. In fact, using big data in marketing analytics is incredibly important for gaining insight into customer behaviours, gauging consumers’ tastes, and satisfying their demand through shorter lead times.
3.
LITERATURE REVIEW
The growth rate of data storage in public and private cloud infrastructure will reach more than 200 zettabytes (ZB) by 2025. This volume of data has increased from 4.4 ZB in 2019 to 44 ZB in 2020 (Morgan, 2020). According to Google statistics, more than 40 000 Google searches occur per second, which is approximately equal to 227 million searches per hour, 3.5 billion per day,1 which indicates the high velocity of big data. Giant fashion retailing companies such as GAP collect their data using various methods such as their own website, Google Analytics, Social Media Platforms, customers’ feedbacks and comments, Google Trends, Browsing History, Customer IP, click-through rate, to extract insights from data for fashion trends (Israeli and Avery, 2018), which make up the big data in fashion. However, big data analytics requires the use of up-to-date technologies and organizational flexibility to optimize decisions (Acharya, et al., 2018). Among the methods that offer much more accurate and efficient approaches than statistical approaches to data are techniques based on artificial intelligence (AI), which have also been introduced with the development of computer technologies (Jelil, 2018; Ren et al., 2018). AI-based models that analyse data for any type of structured, semi-structured, and unstructured data can interpret nonlinear problems effectively, and do not have the strict preconditions about the size and distribution of the dataset, which are of special importance for the meaningful interpretation of the data in the classical statistical method (Liu et al., 2014; Xue et al.,
1
See https://www.internetlivestats.com/google-search-statistics/.
74 Handbook of big data research methods 2018). Machine learning, genetic algorithms and fuzzy logic are some branches of artificial intelligence (Thomassey and Zeng, 2018). Machine learning is the process of training a computer system to make accurate predictions by implementing and applying what it has learned during data training from datasets with minimal human intervention (Ma and Sun, 2020). The artificial neural network is a subset of machine learning composed of a set of algorithms to mimic in a way similar to how the human brain operates to recognize the relationships between datasets (Goodfellow et al., 2016) and machines will be able to learn from input data through mathematical theories of learning (Guo et al., 2011). Deep learning is also a subset of machine learning and learns from very large data in artificial neural networks where the relationship between algorithms is formed through simulation of the human brain. Similarly to what we learn through experience, deep learning algorithms repeat a task over and over again to improve output (Goodfellow et al., 2016). The application of such advanced analytics tools in the fashion industry leads to increased performance of various sectors such as fabric and fashion design (Gu et al., 2020), operational processes (Janssen et al., 2017), inventory management (Hofmann and Rutschmann, 2018), retailing, marketing, financial, customer targeting (Dekimpe, 2020), and so on. Customer segmentation is one of the streams that a variety of big data analytical methods can be used for, for instance analysing customer behaviours, using sales data to segment customers based on their Customer Lifetime Value (CLV), and clustering customers based on their RFM (Recency, Frequency, Monetary) patterns and transactional datasets. The marketing segmentation was conducted by Wong and Wei (2018) using the RFM model, along with the regression model to implement data mining that combined competitor pricing, customer segmentation, and predictive analytics (Wong and Wei, 2018). Miralles-Pechuán et al. (2018) classified the online shopping market using Genetic Algorithms. Griva et al. (2018) suggested a hybrid approach to analyse customer behaviour based on clustering and association rules. Tavakoli et al. (2018) used an approach for analysing consumer patterns through combining RFM with K means clustering. In another perspective of big data analytics, Aluri et al. (2019) conducted a study that engaged dynamic customers with the loyalty program brand to determine the value of customers in the hospitality industry. Moreover, Ahmad et al. (2019) developed a prediction model for churn that allows telecom operators to predict which customers may be subject to churn. Using machine learning techniques on big data platforms, the model developed in this study creates a new approach to engineering and selecting features (Ahmad et al., 2019). All of these studies have highlighted the importance of big data in complex business areas and how data-driven decision-making can help organizations by replacing traditional methods.
4.
RESEARCH APPROACH
Using a systematic literature review, we synthesized the current knowledge in the field of big data analytics and machine learning techniques used in fashion retailing. In this chapter, we have tried to base our search strategy on the two principles of sensitivity and specificity. In developing a sensitive search strategy, terms must be searched both as free-text keywords and thesaurus terms in the title and abstract fields (Bramer et al., 2018). Therefore, sensitivity indicates all the items that are related to the search. However, it is very important that the articles found have specificity, which is the optimality of retrieving a more manageable number of related citations (Atkinson and Cipriani, 2018).
How big data analytics will transform the future of fashion retailing 75 Because the big data technique is one of the interdisciplinary fields in business and computer science, all related articles in the application of big data analytics and machine learning techniques in fashion retail in the field of business and management, computer science and decision science are also examined in order to answer the research question of how big data will transform the future of fashion retailing. The findings of the chapter examine areas of common computer science techniques that can have the greatest impact on fashion business and retail. In fact, techniques such as machine learning algorithms in various fields such as dynamic pricing, customer segmentation, automation and personalization are among the most widely used techniques in the application of big data in the field of computer science, which have wide-ranging effects on business management such as marketing, process operations, organizational strategy and so on. The study formed search strings that combined “big data analytics” with a variety of terms and phrases to identify relevant publications. In this study, Boolean operators (AND, OR) and wildcard symbols (*) were used to reduce the number of search strings. For example, the search string “big data” could return some hits for “big data analytics” and “big data techniques”. In a thorough database search, the terms “big data analytics” were combined with “marketing analytics” and “big data applications”. Additionally, further search strings focused on machine learning techniques were constructed to understand how machine learning techniques can accelerate the use of big data in fashion retailing. In an initial search, the keywords “big data” and “retailing” were considered. “Retailing” is used instead of “fashion retailing” in order to encompass the maximum number of articles in this field. In addition, the study combined the search terms “big data” and “machine learning” with “retailing”. Using field codes TITLE-ABS-KEY instead of DOCTITLE() alone provides more specificity in the database search. Four databases, Scopus, Web of Knowledge, Emerald, and Science Direct, were investigated. As the sensitivity and specificity were considered simultaneously, this targeted search showed 84 articles in the first phase. The second search indicated nine specific articles regarding this research area.
5. FINDINGS After finding relevant articles on big data and machine learning techniques in the retail and fashion industry, and studying them, it was realized that in each study there were different steps to collect and use data to gain practical insights from big data. What is apparent is that there is no single way to differentiate analytics techniques. In this study, according to the study of previous research, in order to make the best use of big data in the fashion industry, a four-step value chain that includes data collection, data integration and storage, data analytics and data exposition is examined. Therefore, in this study, we have tried to classify and differentiate between analytics models applicable in the fashion industry, using four groups based on the big data value chain (Figure 5.1).
6.
BIG DATA VALUE CHAIN MODEL
The big data value chain approach helps fashion retailers to consider the characteristics of big data to gain actionable insights from the data, and organizations can benefit from digital
76 Handbook of big data research methods
Figure 5.1
Big data value chain
data at different stages of the value chain, from the time of data creation to their destruction in different phases of the chain (Faroukhi et al., 2020). Thus, the big data value chain approach divides the data analysis process into different phases for value extraction. In the following, different parts of the big data value chain (BDVC) in fashion retailing from the time of data collection to exposition and usable techniques are shown. 6.1
Data Collection
The first step in the big data value chain is how data is created and collected. Data can be generated by humans, sensors and systems, both actively and passively, from internal or external sources (Adnan and Akbar, 2019). Also, with the advent of social media, Instagram, Twitter, bloggers, and so on, a rich collection of insight-based data can be obtained, which usually sensitivity analysis applies to gain insights from public ideas (Saggi and Jain, 2018). Data can also be generated from various heterogeneous data formats, such as structured, semi-structured and unstructured. Historical data from retailers and customers’ purchase history, IP, and geolocation are considered structured data (Israeli and Avery, 2018). Online data such as texts, tweets and emails are examples of semi-structured data, and images, audio and videos are unstructured data, which are challenging data formats (Kambatla et al., 2014). When customers browse the Internet, some advertisements are targeted by machine learning algorithms through personalized web pages, often referred to as website morphing algorithms (Ma and Sun, 2020), which lead customers to be interested in a series of information. When the customer refers to the target store website which leads to their purchase, the machine learning algorithms extract the customer preferences. Also, artificial intelligence machines such as Amazon Alexa through voice recognition, Chatbots through the processing of language algorithms, Augmented Reality by capturing images (Sung, 2021), collect data that leads to customer engagement and access to big data.
How big data analytics will transform the future of fashion retailing 77 6.2
Data Integration and Warehousing
Raw data collected from various sources may be noisy and redundant, and they can lead to meaningless results in the analysis phase or affect the quality of other data. Therefore, in this stage, to improve the reliability of the data, the following five steps should be done: ● Data cleaning, which includes removing missing values and inefficient data from the data set (Rehman et al., 2016); ● Data reduction involves reducing dimensions in order to identify substantially relevant big datasets using methods such as feature selection and limiting them to produce highly relevant datasets for big data analytics (Rehman et al., 2016); ● Data transformation, which changes raw data into a format suitable for further analysis in later stages (García et al., 2016); ● Data integration, which involves the process of combining data views to create a single view of data scattered across different locations (Wang, 2017); ● Data discretization, which involves converting continuous data values into a finite set of intervals (Ramírez-Gallego et al., 2015). Pre-processed raw data from various sources is then sent to the big data storage infrastructure for the next phase of data analysis. No SQL datasets are commonly used to store large, unstructured data (Storey and Song, 2017). Machine learning methods make it possible to integrate hybrid data. This feature is very important because customers are exposed to different types of data. For example, an ad for a product is placed on a website that contains text and image data, or a potential customer watching a promotional video is commenting on social media. A deep understanding of these customer preferences and the combination of different types of data can identify many patterns of behaviour and opportunities (Hartmann, et al., 2019). 6.3
Data Analytics
The purpose of big data analytics is to analyse organized and stored data to extract insights and answer complex business questions (Dorschel, 2015). There are three main categories for data analytics that include descriptive, predictive and prescriptive. Descriptive analytics is the most common type of data analytics, used by retailers to tell a story about their data. Analysing descriptive statistics for a sample rather than estimating the true population value is what is referred to as descriptive analytics. The measure of central tendency in descriptive statistics includes the mean, mode, median, standard deviation, the shape of data and skewness. Graphs, charts and lines are typically used to display results. The company captures and stores information about products, customers, and sales in structured relationship databases (SQL), for instance. These data can then be retrieved, transformed, and stored in data warehouses periodically. Managers can then examine the correlation between all of the possible attributes. They can, for example, identify which products are popular based on customers’ gender and age. Analysing data in this manner is useful for planning shelves, recommending products and offering discounts (Duan and Xiong, 2015). Analysing data from multiple sources gained from previous phases in BDVC generates invaluable insight into the past and present performance of a company. Data is categorized, characterized, and classified to provide useful information to help analyse and understand business performance. Data about budgets, sales, revenue or cost can be presented in descrip-
78 Handbook of big data research methods tive analytics as charts and reports (Grover et al., 2018). Managers can obtain standard and customized reports, as well as drill down into the data and create queries in order to evaluate the effectiveness of an advertisement, identify problem areas and opportunities, and identify trends in data. Classification, clustering, and association rules are examples of descriptive analytics techniques (Gandomi and Haider, 2015) that explore historical data through which trends, patterns, and insights from data are presented in a comprehensible way. Some indicators such as churn rate, conversion rate, adoption rate and click-through rate which can be obtained through the Google service and the companies’ own websites are examples of descriptive analytics and indicate customers’ purchase journey. Descriptive analytics allows the implementation of tasks to be examined and which measures generate more revenue, which is considered based on the two main phases of data aggregation and data mining (Deshpande et al., 2019). Some of the techniques that can be used for descriptive analysis include cluster analysis (Jun et al., 2014) and association rules (Saggi and Jain, 2018), which these techniques help in dealing with big data. Some of the goals of descriptive analytics with a marketing perspective include examining real and target audiences, gaining insight into customer behaviour patterns, examining overall product demand and demand analysis in a cluster or segment, estimating the effectiveness of marketing campaigns, evaluating costs, and comparison of indicators with each other at different times, and so on. The KPI governance is one of the most significant factors for the success of descriptive analytics and well-set goals are the main pillars of greater effectiveness in predictive and prescriptive analysis. Predictive analytics examines historical results in order to predict future outcomes. Data is analysed by finding patterns, detecting relationships among these patterns, and projecting these relationships into the future. For example, marketers can predict the reactions of customers to advertising campaigns, commodities traders can predict short-term fluctuations in commodity prices, and ski-wear manufacturers can predict demand for ski-wear of a particular size and colour next season. Through predictive analytics, it is possible to identify risk and relationships that have not been apparent through traditional methods. Advanced predictive analytics techniques can be used to uncover hidden patterns within volumes of data that are then segmented into coherent sets so that trends can be revealed and behaviours can be predicted. Some common predictive techniques are decision trees (Duan and Xiong, 2015), regression (Ma and Sun, 2020), Naïve Bayes (Allenby et al., 2014), Support Vector Machine (SVM) (Huang and Luo, 2016), and neural networks (Liu and Toubia, 2018). Moreover, deep neural networks (Kraus et al., 2019) is one of the most powerful data processing methods to structure data from different sources to facilitate input for computer modelling. Prescriptive analytics identifies the best alternatives for minimizing or maximizing some objective through using optimization. The use of predictive analytics extends to a number of areas of business, such as marketing, finance, and operations. In the marketing area, for example, it is used to maximize revenue, and how to determine the best pricing and advertisement strategy. The combination of predictive analytics and optimization is an effective way to make data-driven decisions in uncertainty (Evans and Linder, 2012). Prescriptive analytics is an application of applied learning that utilizes data from descriptive and predictive to solve real-world problems. This emerging field involves advanced optimization, simulation, game theory, and decision analysis methods (Hofmann and Rutschmann, 2018). A prescriptive analytics approach helps answer questions such as: What is the optimal production level to maximize profits? How can we minimize the cost of delivering goods from factories? Probabilistic
How big data analytics will transform the future of fashion retailing 79 models, machine learning, mathematical programming, and logic-based models are some of the methods that are extensively applied in prescriptive analytics (Lepenioti et al., 2020). 6.4
Data Exposition and Modelling
The final phase includes sharing and visualizing the data, insights and knowledge gained over the previous phases. Learning and decision-making begin at this point based on the data analysis algorithms obtained at the previous stage. At this stage, AI systems must be used. In fact, continuous decisions are made promptly by machine learning-based models that are specifically used to learn from predictive analytics as well as through AI systems and AI-based decision-making. Automation plays a vital role in the fashion industry because of the shortened life-cycle of data (Ren et al., 2018). AI-based decisions can be automated through systems such as chatbots and robots. Robots and chatbots, for example, can assist with customer service and warehouse automation, respectively (Shankar, 2019).
7.
GENERAL DISCUSSION: FUTURE OF BIG DATA IN FASHION RETAILING
As examined in the big data value chain steps, each step is of particular importance in preparing the data for analysis and insight extraction in fashion retailing. The complex nature of big data concepts in business, however, presents various challenges to researchers. With the increasing complexity in modelling, researchers are also turning to machine learning solutions as a valuable alternative to traditional statistical and econometric models. SVM, decision trees, regression, Naïve Bayes, and deep neural networks are among the machine learning approaches that have been employed in academic business research for prediction and insight extraction, as noted in BDVA. Additionally, they are frequently used in situations when standard quantitative approaches are ineffective, making them particularly useful. However, despite the increased interest, the application of machine learning methods in marketing and business is still in its early stages, and existing research is fairly fragmented. Up to now, there does not appear to be a coherent framework for using machine learning approaches in business research. Machine learning approaches, as opposed to traditional statistical and econometric models used in business research, can successfully analyse unstructured large-scale data resulting in good prediction performance. Meanwhile, machine learning approaches are typically difficult to comprehend, and their capacity to capture the heterogeneity of customers and dynamics has yet to be demonstrated. The fashion industry has been and will continue to be significantly transformed by big data and machine learning. Scalable and intelligent algorithms are increasingly powering marketing practices such as personalization and targeting, optimization and automation, customers’ online and in-store shopping journey and so forth (Ma and Sun, 2020). Several key developments in the fashion industry have been fuelled by big data and machine learning, and some of these trends are briefly addressed in the following section. 7.1
Customer Segmentation and Discovery through Clustering
Using AI, companies can better understand their customer segments, allowing marketing managers to target and position their campaigns better to achieve maximum results. This approach
80 Handbook of big data research methods allows marketers to identify consumer segments based on certain parameters, which enables more precise marketing messages to be targeted efficiently and to create products and brands that can be tailored to each segment. With AI, marketers are able to predict customer intent and segment them based on their preferences. Due to the massive heterogeneity of consumer tastes, segmentation offers enormous potential, from targeting promotions and ads to improving product recommendations. The clustering algorithms are used to identify relationships between observed data even though there are no explicit signs of association. Clustering can also be used to develop rules for categorizing future data (Campbell et al., 2020). To inform store offerings, Nike also uses geolocation and behavioural data that it collects through its app, as well as clustering algorithms to help it determine which items to display together (McDowell, 2019). 7.2
Dynamic Pricing through Regression Models
A company’s pricing strategy is central to maximizing sales. A pricing strategy involves determining the price to charge for products and services, developing an understanding of consumer price sensitivity, and mapping competitor pricing. A wide range of AI applications can assist, including estimating consumer price elasticity, enabling dynamic pricing, detecting pricing errors, fraud, and non-profitable consumers, and detecting pricing anomalies (Campbell et al., 2020). Using machine learning regression techniques, marketers may be able to estimate numerical values by taking into account pre-existing information, which will then be used to improve customer service. As a customer’s journey progresses, Amazon collects and analyses data at multiple stages, from pre-purchases such as browsing products and entering search terms, reviews, page views, to purchase data from past purchase histories, as well as post-purchase activities such as returns and service interactions (Ke, 2018). 7.3
Superior Reporting through Automated Data Visualization
Artificial intelligence is much faster and more capable than human experts at transforming data into visual insights. Visualizations are normally generated manually by analysts using tools such as Excel or Tableau, but with automatic enterprise analytics solutions such as Qlik, analysts can gather data sources and generate useful dashboards and reports for the marketing department. Increasingly, platforms are using advanced machine learning algorithms and data analytics to uncover market trends and behavioural patterns which can be typically hidden from plain sight and difficult to discern (Mero et al., 2020). 7.4
User Insight and Personalization Using Text Classification
A natural language processing (NLP) system is able to analyse text- or voice-based content, and then classify it according to variables such as tone, sentiment, or topic to generate consumer insights or customize content to meet the interests of the audience. Text, whether from product reviews, user-generated social media or marketing initiatives by firms, provides real-time data that can shed light on consumer preferences. This can serve as an alternative to traditional marketing research tools (Berger et al., 2019). Tone Analyser by IBM Watson can be used to evaluate online customer feedback and determine what users are saying generally about a product (Packowski and Lakhana, 2017).
How big data analytics will transform the future of fashion retailing 81 7.5
Customer Experience Automation Using Chatbots
A chatbot is one of the most popular and widely used applications of artificial intelligence, but the majority of them are completely scripted, without any natural language processing or machine learning. The higher the sophistication of the dialogue system, the more external knowledge bases can be accessed; unusual questions can be accommodated, and they can escalate to human agents if necessary. Several companies have already implemented chatbots to engage consumers across their lifecycle, from the moment they learn about a brand to after they make a purchase and need customer service (Hoyer et al., 2020). 7.6
Branded Object Recognition Using Computer Vision
The field of computer vision is rapidly advancing in artificial intelligence, thereby creating many possibilities for businesses and marketing areas. ML-powered computer vision enables marketers to extract insights from images and videos that are unlabelled (Salminen et al., 2019). This enables marketers to identify when their brand logos appear in user-generated content and to determine earned media through video analysis. Tech-savvy marketers can integrate APIs like Clarifai into their websites to build custom solutions for content moderation, search engines, and recommendation engines based on visual similarities (Nanne et al., 2020). 7.7
Automation of Marketing Operations using Robotic Process Automation
The complexity of the business environment has been beyond the intuitive comprehension and manual capabilities of human analysts. While it is feasible for humans to determine targeting strategies for many segments, automation is required for hundreds of clearly delineated microsegments. Real-time replies are also required for frequent contact. When mobile monitoring identifies an inbound customer, for example, a promotional offer must be delivered within minutes. Marketing is increasingly relying on automation and real-time optimization (Miklosik et al., 2019). The go-to solutions are deep learning and machine learning. Automations made for digital marketers are aimed at simplifying work and making work more efficient. They read emails, open and analyse attachments, create templated reports to track social media engagement, and enter data automatically. The AI platform Albert for online ads reduces human involvement in large-scale media buying by installing algorithms that can quickly calculate required data and optimize paid advertising campaigns (Campbell et al., 2020).
8.
ETHICAL IMPLICATIONS
As machine learning and big data are rapidly advancing, there has been an increase in algorithm-driven, biased decisions, for example, for sorting applications for job interviews, evaluating mortgage applications, and offering credit products. Many studies have shown that biased algorithms have a disparate impact on certain groups in society due to their racial, ethnic and socioeconomic differences (Akter et al., 2021). Consumer privacy has become more important as a result of the big data revolution. Digitalization, and the movement of
82 Handbook of big data research methods information across organizational borders, will increase privacy, security, and intellectual property concerns (Dekimpe, 2020). According to Guha et al. (2021), ethics-related issues of using AI in retailing can be categorized into the three areas of privacy, bias and appropriateness. In terms of data privacy, there are two aspects to consider: identifiability and sensitivity of data (Altman et al., 2018). Some data of customers such as credit card numbers are entirely recognizable. There are situations in which customer characteristics such as age, gender and zip code can be used to identify a single individual. Moreover, the data’s level of sensitivity is important as well. AI applications will increase both identifiability and sensitivity of customer data, regardless of whether it involves data on political views, religious beliefs, or private transactions (certain medicines). As such, customers’ privacy concerns may be heightened if retailers use artificial intelligence to categorize their purchases based on personal characteristics. According to De Bruyn et al. (2020), bias is a serious issue in retailing. During the learning process, AI may infer certain biases based on historical data sets. As an example, race and gender may not be explicitly included in an AI algorithm, but they may be inferred from data such as user preferences obtained from tracking their views, clicks and scrolls and used to increase the price of products for particular demographics (De Bruyn et al., 2020). For example, while not utilizing gender in their algorithm (although other characteristics utilized might be connected with gender), the Apple credit card provided women lower limits of credit than men, but no one from Apple could explain why. This was extensively publicized in the news, resulting in a terrible public relations situation (Guha et al., 2021). Furthermore, there are concerns regarding appropriateness. This topic is related to worries about AI applications that are completely legal but may cause controversy and unfavourable public perception. Using AI, retailers can, for example, determine if customers are likely to be divorced or suffering from mental health concerns based on overt indications such as facial traits, names, and previous purchases (Guha et al., 2021). For example, the customer might acquire new consuming habits, require extra items and services, or display brand preferences after a divorce. There may be some who argue that this sort of connection data is to such a great extent private that businesses should not make any inferences from it.
9. CONCLUSION This research sought to gain insights into how to generate value from data obtained from different sources in the fashion retailing industry and how big data analytics can assist organizations in gaining insight from data and predicting customer behaviour. Technology, big data, and competitiveness will accelerate the development of AI applications enabled by machine learning techniques in every facet of business and marketing in the next decades. It is critical for the fashion industry to incorporate the use of valuable digital data to get a better understanding of businesses and customers, solve developing substantial concerns in the field, and build scalable and automated decision-making skills that will become indispensable to corporate executives. From all of these viewpoints, big data analytics and machine learning technologies offer a number of promises to address key challenges in the retail and fashion industries. It is crystal clear that incorporating big data analytics and machine learning approaches for marketing objectives effectively will be challenging for businesses; however, the advantages of applying these techniques outweigh its challenges.
How big data analytics will transform the future of fashion retailing 83
REFERENCES Acharya, A., Singh, S.K., Pereira, V. and Singh, P., 2018. Big data, knowledge co-creation and decision making in fashion industry. International Journal of Information Management, Volume 42, pp. 90–101. Adnan, K. and Akbar, R., 2019. An analytical study of information extraction from unstructured and multidimensional Big Data. Journal of Big Data, 6(1). Ahmad, A., Jafar, A. and Aljoumaa, K., 2019. Customer churn prediction in telecom using machine learning in big data platform. Journal of Big Data, 6(2). Akter, S., Wamba, S.F., Gunasekaran, A., Dubey, R. and Childe, S.J. 2016. How to improve firm performance using big data analytics capability and business strategy alignment? International Journal of Production Economics, 182, 113–31. Akter, S., Dwivedi, Y.K., Biswas, K. Michael, K., Bandara, R.J. and Sajib, S. 2021. Addressing algorithmic bias in AI-driven customer management. Journal of Global Information Management, 29(6). Allenby, G.M., Bradlow, E.T., George, E.I., Liechty, J. and McCulloch, R.E. 2014. Perspectives on Bayesian methods and big data, Customer Needs and Solutions, 1(3), 169–75. Altman, M., Wood, A., O’Brien, D.R. and Gasser, U. 2018. Practical approaches to big data privacy over time. International Data Privacy Law, 8(1), 29–51. Aluri, A., Price, B.S. and McIntyre, N.H. 2019. Using machine learning to cocreate value through dynamic customer engagement in a brand loyalty program. Journal of Hospitality & Tourism Research, 43(1), 78–100. Atkinson, L. and Cipriani, A. 2018. How to carry out a literature search for a systematic review: A practical guide. BJPsych Advances, 24(2), 74–82. Berger, J., Humphreys, A. and Schweidel, D.A. 2019. Uniting the tribes: Using text for marketing insight. Journal of Marketing, pp. 1–25. Bramer, W.M., De Jonge, G.B., Rethlefsen, M.L., Mast, F. and Kleijnen, J. 2018. A systematic approach to searching: An efficient and complete method to develop literature searches. Journal of the Medical Library Association, 106(4), 531–41. Campbell, C., Sands, S., Ferraro, C., Tsao, H.-Y. and Mavrommatis, A. 2020. From data to action: How marketers can leverage AI. Business Horizons, 63(2), 247–3. De Bruyn, A., Viswanathan, V., Beh, Y.S., Brock, J.K.U. and von Wangenheim, F. 2020. Artificial intelligence and marketing: Pitfalls and opportunities. Journal of Interactive Marketing, 51, 91–105. Dekimpe, M.G. 2020. Retailing and retailing research in the age of big data analytics. International Journal of Research in Marketing, 37(1), 3–14. Deshpande, P.S., Sharma, S.C. and Peddoju, S.K. 2019. Predictive and prescriptive analytics in big-data era. Security and Data Storage Aspect in Cloud Computing, pp. 71–81. Dorschel, J. 2015. Praxishandbuch Big Data. Karlsruhe: Springer Gabler. Duan, L. and Xiong, Y. 2015. Big data analytics and business analytics. Journal of Management Analytics, 2(1), 1–21. Evans, J.R. and Linder, C.H. 2012. Business Analytics: The Next Frontier for Decision Sciences. University of Cincinnati: Decision Science Institute. Faroukhi, A.Z., El Alaoui, I., Gahi, Y. and Amine, A. 2020. Big data monetization throughout Big Data Value Chain: A comprehensive review. Journal of Big Data, 7(1). Gandomi, A. and Haider, M. 2015. Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137–44. García, S., Ramírez-Gallego, S., Luengo, J., Benítez, J.M. and Herrera, F. 2016. Big data preprocessing: Methods and prospects. Big Data Analytics, 1(1). Goodfellow, I., Bengio, Y. and Courville, A., 2016. Deep Learning. Cambridge, MA: MIT Press. Griva, A., Bardaki, C., Pramatari, K. and Papakiriakopoulos, D. 2018. Retail business analytics: Customer visit segmentation using market basket data. Expert Systems with Applications, 100, 1–16. Grover, V., Chiang, R.H.L., Liang, T.-P. and Zhang, D., 2018. Creating strategic business value from big data analytics: A research framework. Journal of Management Information Systems, 35(2), 388–423. Guha, A., Grewal, D., Kopalle, P.K., Haenlein, M., Schneider, M.J., Jung, H., Moustafa, R. et al. 2021. How artificial intelligence will affect the future of retailing. Journal of Retailing, 97(1), 28–41.
84 Handbook of big data research methods Guo, Z.X., Wong, W.K., Leung, S.Y.S. and Li, M. 2011. Applications of artificial intelligence in the apparel industry: A review. Textile Research Journal, 81(18), 1871–92. Gu, X., Gao, F., Tan, M. and Peng, P. 2020. Fashion analysis and understanding with artificial intelligence. Information Processing & Management, 57(5). Hartmann, J., Huppertz, J., Schamp, C. and Heitmann, M. 2019. Comparing automated text classification methods. International Journal of Research in Marketing, 36, 20–38. Hofmann, E. and Rutschmann, E. 2018. Big data analytics and demand forecasting in supply chains: A conceptual analysis. The International Journal of Logistics Management, 29(2), 739–66. Hoyer, W.D., Kroschke, M., Schmitt, B. and Kraume, K. 2020. Transforming the customer experience through new technologies. Journal of Interactive Marketing, 51, 57–71. Huang, D. and Luo, L. 2016. Consumer preference elicitation of complex products using fuzzy support vector machine active learning. Marketing Science, 35(3), 445–64. Israeli, A. and Avery, J. 2018. Predicting consumer tastes with big data at Gap. Harvard Business Review. Janssen, M., van der Voort, H. and Wahyudi, A. 2017. Factors influencing big data decision-making quality. Journal of Business Research, 70, 338–45. Jelil, R.A. 2018. Review of artificial intelligence applications in garment manufacturing. In: S. Thomassey and X. Zeng (eds), Artificial Intelligence for Fashion Industry in the Big Data Era. Singapore: Springer, pp. 97–123. Jun, S., Park, S.-S. and Jang, D.-S. 2014. Document clustering method using dimension reduction and support vector clustering to overcome sparseness. Expert Systems with Applications, 41(7), 3204–12. Kambatla, K., Kollias, G., Kumar, V. and Grama, A. 2014. Trends in big data analytics. Journal of Parallel and Distributed Computing, 74(7), 2561–73. Ke, W. 2018. Power pricing in the age of AI and analytics. Online. Available at: https://www.forbes .com/sites/forbesfinancecouncil/2018/11/02/power-pricing-in-the-age-of-ai-and-analytics/?sh= 432ab64d784a. Accessed 1 December 2021. Kraus, M., Feuerriegel, S. and Oztekin, A. 2019. Deep learning in business analytics and operations research: Models, applications and managerial implications. European Journal of Operational Research. Lepenioti, K., Bousdekis, A., Apostolou, D. and Mentzas, G., 2020. Prescriptive analytics: Literature review and research challenges. International Journal of Information Management, 50, p. 57–70. Liu, J. and Toubia, O., 2018. A semantic approach for estimating consumer content preferences from online search queries. Marketing Science, 37(6), 855–82. Liu, X., Iftikhar, N. and Xie, X. 2014. Survey of real-time processing systems for big data. Proceedings of the 18th International Database Engineering & Applications Symposium, pp. 356–61. Ma, L. and Sun, B. 2020. Machine learning and AI in marketing – Connecting computing power to human insights. International Journal of Research in Marketing. McDowell, M. 2019. Stores get smart about AI. Online. Available at: https://www.voguebusiness.com/ technology/artificial-intelligence-physical-stores-kering-nike-alibaba. Accessed 1 December 2021. Mero, J., Tarkiainen, A. and Tobon, J. 2020. Effectual and causal reasoning in the adoption of marketing automation. Industrial Marketing Management, 86, 212–22. Miklosik, A., Kuchta, M., Evans, N. and Zak, S. 2019. Towards the adoption of machine learning-based analytical tools in digital marketing. IEEE Access, 7, 85705–718. Miralles-Pechuán, L., Ponce, H. and Martínez-Villaseñor, L. 2018. A novel methodology for optimizing display advertising campaigns using genetic algorithms. Electronic Commerce Research and Applications, 27, 39–51. Morgan, S. 2020. The world will store 200 zettabytes of data by 2025. Online. Available at: https://c ybersecurityventures.com/the-world-will-store-200-zettabytes-of-data-by-2025/. Accessed 11 June 2021. Nanne, A.J., Antheunis, M.L., van Noort, G., van der Lee, C.G., Postma, E.O and Wubben, S. 2020. The use of computer vision to analyze brand-related user generated image content. Journal of Interactive Marketing. Packowski, S. and Lakhana, A., 2017. Using IBM Watson cloud services to build natural language processing solutions to leverage chat tools. Proceedings of the 27th Annual International Conference on Computer Science and Software Engineering.
How big data analytics will transform the future of fashion retailing 85 Ramírez-Gallego, S., García, S., Mouriño-Talin, H., Martínez-Rego, D., Bolón-Canedo, V., Alonso-Betanzos, A. and Benítez, J.M. 2015. Data discretization: Taxonomy and big data challenge. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 6(1), 5–21. Rehman, M., Liew, C.S., Abbas, A., Jayaraman, P.P., Wah, T.Y. and Khan, S.U. 2016. Big data reduction methods: A survey. Data Science Engineering, 1(4), 265–84. Rehman, M.H., Chang, V., Batool, A. and Wah, T. 2016. Big data reduction framework for value creation in sustainable enterprises. International Journal of Information Management, 36(6), 917–28. Ren, S., Hui, C.L.P. and Choi, T.M.J. 2018. AI-based fashion sales forecasting methods in big data era. In: S. Thomassey and X. Zeng (eds), Artificial Intelligence for Fashion Industry in the Big Data Era. Singapore: Springer, pp. 9–26. Saggi, M.K. and Jain, S., 2018. A survey towards an integration of big data analytics to big insights for value-creation. Information Processing & Management, 54(5), 758–90. Salminen, J., Yoganathan, V., Corporan, J., Jansen, B.J. and Jung, S.G. 2019. Machine learning approach to auto-tagging online content for content marketing efficiency: A comparative analysis between methods and content type. Journal of Business Research, 01, 203–217. Shankar, V., 2019. Big data and analytics in retailing. NIM Marketing Intelligence Review, 11(1), 36–40. Silva, E.S., Hassani, H. and Madsen, D.Ø. 2019. Big Data in fashion: Transforming the retail sector. Journal of Business Strategy, 41(4), 21–7. Storey, V. and Song, I.-Y. 2017. Big data technologies and management: What conceptual modeling can do. Data Knowledge Engineering, 108, 50–67. Sung, E. 2021. The effects of augmented reality mobile app advertising: Viral marketing via shared social experience. Journal of Business Research, 22, 75–87. Tan, K.H., Zhan, Y.Z., Ji, G., Ye, F. and Chang, C. 2015. Harvesting big data to enhance supply chain innovation capabilities: An analytic infrastructure based on deduction graph. International Journal of Production Economics, 165, 223–33. Tavakoli, M., Molavi, M., Masoumi, V., Mobini, M., Etemad, S. and Rahmani, R. 2018. Customer segmentation and strategy development based on user behavior analysis, RFM model and data mining techniques: A case study. IEEE 15th International Conference on e-Business Engineering (ICEBE), pp. 119–26. Thomassey, S. and Zeng, X. 2018. Artificial Intelligence for Fashion Industry in the Big Data Era. Singapore: Springer. Wang, L. 2017. Heterogeneous data and big data analytics. Automatic Control and Information Sciences, 3(1), 8–15. Wong, E. and Wei, Y. 2018. Customer online shopping experience data analytics: Integrated customer segmentation and customised services prediction model. International Journal of Retail & Distribution Management, 46(4), 406–20. Xia, M., Zhang, Y., Weng, L. and Ye, X. 2012. Fashion retailing forecasting based on extreme learning machine with adaptive metrics of inputs. Knowledge-Based Systems, 36, 253–9. Xue, Z., Zeng, X. and Koehl, L., 2018. Artificial intelligence applied to multisensory studies of textile products. In: S. Thomassey and X. Zeng (eds), Artificial Intelligence for Fashion Industry in the Big Data Era. Singapore: Springer, pp. 211–44.
6. Descriptive analytics and data visualization in e-commerce P.S. Varsha and Anjan Karan
INTRODUCTION Emerging technologies and the explosion of big data have occurred in recent years in both academia and industry (Akter and Wamba, 2016). Companies deployed big data analytics (using the 7Vs) to disseminate the information in order to increase their value chain efficiency or productivity by 5–6 percent relative to their competitors (Wamba et al., 2015). Hence, the data can be generated in the form of structured, semi-structured and unstructured data (Ghofrani et al., 2018). Also, big data is gaining momentum and reshaping our views in today’s fastest-growing era (Chen et al., 2012). In the present scenario industries in various sectors are moving towards big data to help firms to achieve competitive advantage (Zhang et al., 2015). Subsequently, there has been increased interest in analytics such as statistical models, artificial intelligence (AI), data mining and machine learning (ML), which are getting more popular due to the massive data explosion that encourages data-driven decisions (Delen and Ram, 2018). From Google trends, analytics have been gaining more attention over the last ten years across the globe (Yin and Fernandez, 2020). For instance, Walmart’s agreement with HP (Hewlett-Packard) to develop a data warehouse to monitor and record all transactions of 6000 stores across the globe. By deploying ML, Walmart was able to detect patterns in these databases which allowed them to analyze pricing strategies and advertising campaigns and to optimize the supply chain (Marzouk and Enaba, 2019). This illustration emphasizes the collection of tools and technologies such as databases, data warehouses, customer relationship management (CRM), data visualization, enterprise resource planning (ERP), supply chain management (SCM) and web applications to provide meaningful insights to the data in the firms (Chaudhuri et al., 2011; Davenport and Harris, 2007; Shen and Tzeng, 2016). Hence, integrated applications and big data help the managers, analysts and executives to make the right decisions for their firms (Shanks and Bekmamedova, 2012). Furthermore, the data will be collected from various sources like Twitter, Facebook, YouTube, blogs, wikis, emails and mobile applications, and their data forms such as numerical, audio, video and unstructured forms (Raghupathi and Raghupathi, 2021). These huge amounts of data are collected while stakeholders use these big data to advance their business decisions and improve organization performance (Sharma et al., 2014). This is where analytics plays an important role in transforming the information into knowledge for constructive decision-making (Martinez-Martinez et al., 2018; Martinez-Martinez et al., 2019; Raghu and Vinze, 2007). Hence analytics becomes a core component of a firm’s process by using tools and techniques to solve the problems (Vidgen et al., 2017). Analytics is the short abstract of business analytics (Gorman and Klimberg, 2014; Holsapple et al. 2014). It is also referred to as data analytics (Shen and Tzeng, 2016). Business analytics refers to the application of models, techniques and tools used for data analysis to provide meaningful decisions (Jalali and 86
Descriptive analytics and data visualization in e-commerce 87 Park, 2018). It is also a process that involves the usage of statistical techniques, information software and operation research methodologies to inspect, visualize, invent and communicate data patterns (Sedkaoui, 2018). Implementation of business analytics in any organization helps to improve decision-making by creating models, metrics and dashboards from real-time data (Sahu et al., 2017). Models include regression, ML and data mining (clustering), AI (search, image and voice recognition). Methods comprise visualization, numerical outputs, and so on. Tools include business intelligence analytic tools such as Tableau, Tibco and Alteryx, and programming languages consist of R and Python (Raghupathi and Raghupathi, 2021). Lastly, to conclude, the above definitions of business analytics are data-driven, fact-based decision-making in the firms to provide lots of choice. Business analytics are categorized into four types (Raghupathi and Raghupathi, 2021; Wang et al., 2016): Descriptive (what happened); Predictive (what will happen next); Prescriptive (what can be done); and Discovery/Wisdom (how it can be discovered). Descriptive analytics enables knowing what happened in the past and what will happen in present business situations (Yin and Fernandez, 2020). Descriptive analytics is quite simple to deploy for data analysis without any complex calculations (Raghupathi and Raghupathi, 2013). Predictive analytics mainly focuses on forecasting the future scenario and it is a more advanced type of business analytics that explains the usage of information vs. data by providing choices (Yin and Fernandez, 2020; Raghupathi and Raghupathi, 2013). Consider an example of a finance executive able to predict the shares which include financial products and pricing offerings of various investments from the clients and also anticipate the risks and challenges through statistical modeling and data mining (Raghupathi and Raghupathi, 2021). Prescriptive analytics is involved in the decision-making process to bring actionable solutions and suggestions for specific business problems (Yin and Fernandez, 2020). Prescriptive analytics is used in business knowledge to bring optimal results for specific business problems in the organization and provides recommendations for each problem (Raghupathi and Raghupathi, 2013). Consider an example of prescriptive analytics in various customer relationship management to increase customer satisfaction (Raghupathi and Raghupathi, 2021). Lastly, discovery analytics uses the knowledge about knowledge to recognize new products and services (Raghupathi and Raghupathi, 2013). A prominent example of this is the pharmaceutical industry, which, while introducing a new drug to the market, employed computer simulation and what-if analysis to provide information for firms to introduce the new medicines that are available in the market (Raghupathi and Raghupathi, 2021). Lastly, deploying all types of analytics can help companies to make smart decisions and increase business value (Hindle and Vidgen, 2018). But still, it is challenging when using data collected from several resources to reach logical conclusions deploying business intelligence effectively for interpretation and reflection (Davenport et al., 1998). There will be lots of issues in firms which are dissatisfied with the digitization of budgeting processes (Bergmann et al., 2020). Also, marketers have growing concerns about understanding the big data in firms (Fleming et al., 2018) and also face difficulties in knowing the data culture which includes analysis and decision making (Johnson et al., 2019; Goran et al., 2017). Further, there is not much clarity among the marketers about descriptive analytics on how to clarify metrics into actionable insights (Berman and Israeli, 2021). However, many marketing professionals and researchers are still in the ideation phase in deploying and understanding the benefits of descriptive analytics in the e-commerce industry. Thus, this study fills the void by providing a solution and benchmark regarding the benefits of using descriptive analytics and dashboards for decision making in firms. Furthermore, we develop
88 Handbook of big data research methods an e-commerce case to explore the potential benefits of descriptive analytics through data visualization. Several marketing practitioners use descriptive analytics to develop dashboards (Mintz et al., 2019). Descriptive analytics is very simple and calculates basic statistics, creating a unique way of data visualization in market/business-related problems to analyze the historical data (Kumar, 2017). Thus, our study discovers the importance and value creation of descriptive analytics in the e-commerce industry. Overall, the study answers the following research questions: How is descriptive analytics most valuable to firms? R1: R2: What is data visualization and how does it help marketers to take strategic decisions in the e-commerce industry? To answer these research questions, this study describes the descriptive analytics data visualization and its application through a literature review. Further, the study discusses the research approach and will also use a case study to explain the descriptive analytics and data visualization to help the e-commerce industry. Lastly, findings, implications, challenges and future research will guide scholars and managers in how to improve performance in the e-commerce industry. R1:
How is descriptive analytics most valuable to firms?
The basics of marketing will remain the same despite changing forces that will shape the future of marketing (Grandhi et al., 2020). Customers can resist standardization due to technological advancements which bring more customization/ personalization, resulting in a change in consumer behavior (Kotler and Keller, 2012). Some of the firms deploy descriptive analytics to gain information from the past to aid decision making and other firms use a combination of analytics to make decisions (Sedkaoui, 2018). Furthermore, descriptive analytics provides a statistical perspective by taking past data using algorithms to execute clustering and categorization to explain data. Also, these analytics implement supervised, semi-supervised and unsupervised learning models (Rehman et al., 2016). Supervised machine learning models assume the function from labeled data which is allocated by the researchers, whereas unsupervised learning allows the system to find its structure without any human intervention (Mamo et al., 2021). The main functions of descriptive analytics are text mining, clustering models, ratio analysis and data visualization, which will be used to improve the operational performance and decision making in firms (Balducci and Marinova, 2018). For instance, Chang’s (2019) research showed that fans’ emotions exhibited in the tweets changed throughout a Super Bowl 50 game depending on the location and time interval. This showcases that sports marketing uses social media data to make a significant impact on descriptive analysis and interpretations to examine models and relationships between variables to measure the emotional transition of fans. Subsequently, to optimize store performance, retail atmospherics uses the in-store analytical platform through cognitive computing (Behera et al., 2021). Further, this in-store analytical platform uses descriptive analytics using algorithms through text classification to evaluate store performance, shoppers’ satisfaction time and cognitive analytics able to process the text to assess purchase plans and emotional experience (Behera et al., 2021). Hence, the descriptive analytics benefited the stores in mapping the shopper journeys, enhancing their
Descriptive analytics and data visualization in e-commerce 89 customer relations, improving shopper segmentation, planning campaigns in social media and understanding the customers’ insight in the firms (Behera et al., 2021; Yerpude and Singhal, 2019). The most recent research indicates that entrepreneurs require analytical thinking and computational skills to handle big data and its challenges. As a result, the new generation of entrepreneurs can use descriptive analytics to analyze historical data to create reports, visualize, and understand the entrepreneurial directions roadmap by identifying the firm’s mission, vision and objectives; developing policies and guidelines; and evaluating and interpreting the data for meaningful decisions to provide new entrepreneurial opportunities (Sedkaoui, 2018). Deploying descriptive analytics of Twitter data helps marketers to understand user engagement in mobile apps by creating a summary of historical information, enabling them to make decisions by using statistical measures like tweet statistics, user statistics, URL analytics, hashtag analysis, conversation/mentions, word clouds, reach metrics (Aswani et al., 2018a, 2018b). Also, these analytics are used in the smart destination in the tourism industry by integrating the application programming interfaces, data reservoir and visualization by enhancing customer satisfaction, segmentation and pricing decisions (Zeng et al., 2020; Mariani et al., 2021). All these above studies discuss the importance of descriptive analytics, which brings value to the firms.
LITERATURE REVIEW 1.
Descriptive Analytics
To understand the importance of data, descriptive analytics is referred to as data mining to provide insights on what’s happening with the data based on analysis of historical data (Sathiyanarayanan and Turkay, 2017). It is a technique used to explain and generate a report from past data. Retailers incorporate descriptive analytics to explain and summarize sales across different regions and inventory levels through data visualization, descriptive statistics and data mining techniques (Ridge et al., 2015; Davenport and Dyché, 2013). Also, it is referred to as a tool to convert unstructured data into meaningful data by using descriptive statistics in the fashion retail industry to analyze the frequency and mean average of date of purchase and amount by customizing customer profiles (Giri et al., 2019). To understand the outline of organization information, descriptive analytics summarizes and converts data into meaningful information to generate reports, metrics, dashboards and also provides solutions to certain questions such as: What has happened? What is presently happening? (Mortenson et al., 2015). Further, analytics helps analyze the current trends in the retail industry and provides the solution for what has happened and aims to look for predictive analytics (Bedeley et al., 2018). Subsequently, descriptive analysis is able to generate the dashboards and it can be used in day-to-day operations in the organization to make relevant decisions to achieve competitive advantage (Banerjee et al., 2013). Several studies revealed that descriptive analytics helps in the preparation of data for prescriptive analytics and consists of sums, averages, and percentage changes in sales by retail companies, total profit by product/distribution, number of customer complaints resolved, and the average amount spent per customer, which all support the company in competing for the major players (McCarthy et al., 2022). Then the reports are generated mostly on sales-related products, which creates a big impact on management who are making several decisions (Mathur, 2019). Descriptive analytics aims to examine the
90 Handbook of big data research methods present process to recognize current problems and opportunities (Marzouk and Enaba, 2019). Subsequently, to differentiate other analytics, descriptive analytics are more data-driven in business to outline the information, and results are displayed in charts or reports or in responses to questions using SQL to better understand trends and patterns (Hoyt et al., 2016). Furthermore, analytics helps answer specific questions such as: How many customers bought the products in the last five years? What was the return on investment for the previous quarter? What kind of products are to be sold in quantity? Which products will generate more profit? How about discounts for specific customers? (Raghupathi and Raghupathi, 2013). Nature of descriptive analytics (DA) Descriptive analytics tools are used to ascertain certain business outcomes and describe their results through graphs, and quantitative and numerical analysis. In addition to this, data visualization can be done through descriptive analytics and provides solutions for organization performance (Oswald et al., 2020). Many of the data are examined that have a hidden pattern, which can be narrated through graphical representation and numerical analysis (Sedkaoui, 2018). This is a simple tool to solve the problems quickly and results are interpreted and incorporated in strategic decision making in the organization (Sahay, 2018). Descriptive analytics helps give insight into the data and comprises types, sources of data, data preparation (data cleansing, transformation and modeling), to differentiate structured and unstructured data and also its quality (Raghupathi and Raghupathi, 2018). Data visualization through graphs by using various types of software is the general requirement for descriptive analytics (Brynjolfsson and McElheran, 2016). The visual representation is newly developed through bullet graphs, treemaps and dashboards (Müller et al., 2018). Nowadays, dashboards are becoming more popular with big data. They are used to exhibit the multiple views and interpretations of data graphically (Dilla et al., 2010). In addition, the nature of descriptive analytics is to find out the simple numerical calculations which include measuring the central tendency, measures of position, measures of variation, and measures of shapes by using various statistics to draw the results for decision making in the organization (Kumar, 2017). From the empirical perspective, descriptive analytics helps establish the relationship between the two variables, which are the covariance and correlation coefficient (Kumar, 2017). Subsequently, the tools of descriptive analytics are numerical analyses (mean, median, mode, frequency, and so on) and graphical analyses (bar chart, scatter chart, pie chart, coxcomb chart, tree graph, bullet graph, histogram, box plot), which help to identify the trend or patterns of data, thereby helping to improve organizational capabilities (Kumar, 2017). For descriptive analytical techniques to be effective in the firms, marketers must have knowledge of customer database information, sales data, web search data, and how to address customer queries (Sahay, 2018). Types of descriptive analytics Descriptive analytics utilizes the raw data through data aggregation/ data mining to provide valuable insights from historical data. Descriptive analytics can be classified based on unsupervised learning and business statistics (Abdullah et al., 2017; Rawat, 2021). Based on unsupervised learning Association rules: This is a method to identify frequently occurring patterns between the variables in the given data sets (Abdullah et al., 2017). Examples: To detect what type of brands
Descriptive analytics and data visualization in e-commerce 91 customers are purchasing more frequently in e-commerce; to reveal what elective courses are frequently selected by students in institutions. Sequence rules: This approach distinguishes the sequence of events occurring in the given data (Abdullah et al., 2017). Example: To recognize the series of purchase patterns of customers of hypermarkets and supermarkets for groceries, to reveal customers’ sequence of patterns in the search for keywords and number of visits to websites. Clustering/ Segmenting: This mechanism identifies the similarities and dissimilarities of data sets and is able to structure the item without labeling data (Abdullah et al., 2017). Example: Comparison between brands in global markets in terms of segmentation, targeting, positioning of product data sets. Based on business statistics Measures of frequency: By counting the number of times each variable occurs in the sample (Rawat, 2021). Example: Number of male and female customers purchasing Nike products on the Amazon website. Measures of central tendency: This mode finds the middle or average of a data set, most commonly known as the mean, median, or mode (Rawat, 2021). Example: Grouping data, people subscribing to Netflix, budgeting, comparison of salaries, different businesses. Measures of variation: In this procedure data is used to measure the variability of how far the data points can fall in the center (Rawat, 2021). Example: women are more frequently obsessed with coupons or offers to purchase lifestyle products when compared to men. Measures of position: This method recognizes whether certain data points/values falling in the given sample distribution of value will be either average, high, or low (Rawat, 2021). Example: percentiles/quartiles of Flipkart sales from the past year. 2.
Data Visualization
Visualization is most relevant for descriptive analytics which requires visualizing the data with several variables, dimensions, correlations, and so on (Raghupathi and Raghupathi, 2021). It is also referred to as storytelling through ‘science’ (Raghupathi and Raghupathi, 2021). For instance, a picture says more than words can (Raghupathi and Raghupathi, 2013). Several charts can explain the data features and narrate the information (Raghupathi and Raghupathi, 2018). Furthermore, overall charts tell a story about data, and identify and analyze the problems to bring results for decision making and their implications by using tools such as business intelligence (Dill et al., 2012). Appelbaum et al. (2017) in their study suggested that financial reports could be summarized through data visualization using descriptive analytics to understand the financial position of the firm. A recent study revealed that data visualization is used to examine K12 history lessons from eight well-known online courses to evaluate students’ understanding of topics, enabling teachers to improve their classes or revise their lessons (Finholm and Shreiner, 2022). Hence, data visualization is defined as the collection of processes in which data are represented through graphs and interpret a specific goal to gain meaningful information and knowledge (Caughlin and Bauer, 2019). Also, data is displayed through visual aesthetics such as position, length, area and color, to gain the attention of a specific audience (Sinar, 2018, 2015). Hence, analytics is the science used to examine the data to make decisions for specific firms’ goals and objectives through visualizations such as charts, scatter plots, or dashboards (Sharma, 2020). Furthermore, dashboards with multiple
92 Handbook of big data research methods charts help managers to gain information on sales, profits, market share, return on investment, and so on, to keep them updated on current market scenarios (Sharma, 2020). The leading software tools available in the visualization market are Tableau and QlikView, for example (Sharma, 2020). Several past research works have revealed that various domains understand the importance and purpose of visualization for decision making, which include neuroscience (Allen et al., 2012), cognitive science (Shah and Freedman, 2011), health care (Allwood et al., 2013), financial accounting (Anderson and Muller, 2011), marketing and advertising (Ashman and Patterson, 2015), computer science (Borkin et al., 2015), information systems (Benbasat and Dexter, 1986), decision sciences (MacKay and Villarreal, 1987), psychology (Carpenter and Shah, 1998), education (Eilam and Poyas, 2010), human factors and ergonomics (Ali and Peebles, 2013), journalism (Kelly, 1993), operations (Eick and Wills, 1995), sociology (Lewandowsky and Spence, 1989–1990), statistics (Croxton and Stein, 1932), general business (Vila and Gomez, 2016), and general science (Cleveland et al., 1982). Ultimately data visualizations are considered as decision support tools for various firms’ growth perspectives (Caughlin and Bauer, 2019). 3.
Applications of Descriptive Analytics and Data Visualization
In this study we have highlighted the applications of descriptive analytics through data visualization in several domains to reach specific goals/ objectives. The study examined the objective of descriptive analytics in the education sector by analyzing the historical data of student applicants for course enrolment in post-graduation (Kim and Ahn, 2016). By using this technique educational institutions were able to determine and examine the data in learning management systems, to know the frequency of logins, the number of web pages viewed, monitor student data on complete courses, attendance and so on (Kim and Ahn, 2016). Also, in organizations, descriptive analytics provides a better understanding of firms’ performance factors such as revenues/ profits, and shareholder values through financial performance (Bedeley et al., 2018). Subsequently in technological development specifically, data visualization heat maps can identify problems in the organization. Furthermore, ad-hoc customer queries and search-based keywords in business intelligence (BI) can be calculated using this technique in inbound logistics (Gifford, 2013). Additionally, this descriptive analytics can be used to improve performance through interactive data visualization (dashboards) and transformation of goods in supply chain management (Chae et al., 2014). Besides these, descriptive analytics can be used in marketing and sales for data mining to recognize customer patterns including analyzing data on frequently brought products in e-commerce websites by using association of rule mining methods (Bedeley et al., 2018). Furthermore, it helps to understand consumer heterogeneity through neural networks in social media (Hayashi et al., 2010). In the service sector, descriptive analytics uses speech analytics in customer care centers to gather information on customer concerns and to improve customer service based on geospatial data (Ghoshal et al., 2018). Likewise, web-based unstructured data in e-commerce includes the information that can be retrieved and extracted through web intelligence, social media analytics and social network analysis (He et al., 2016). In HR analytics, it will be incorporated to develop dashboards, key performance indicators (KPIs) and metrics for employee performance, recruitment, training and development (Bedeley et al., 2018; Watson, 2009). This mechanism is able to examine and keep track of customer preferences and enhances customer loyalty (Morgan, 2019).
Descriptive analytics and data visualization in e-commerce 93
RESEARCH APPROACH The study was conducted alongside a systematic literature review to synthesize the present knowledge on the definitional aspects of descriptive analytics, which includes types, nature and data visualization in several domains. First, the review process focused on answering the research question: How is descriptive analytics most valuable to firms? This question led to the proper identification and significance of subject areas, and relevant illustrations bring significant insight for experts and academicians. Second, the study extends the research contribution by addressing the second research question: What is data visualization and how does it help marketers to take strategic decisions in the e-commerce industry? This question paves the way to defining data visualization, proposing a conceptual model, and a case study was developed on the Indian Flipkart e-commerce industry which discusses descriptive analytics through data visualization for strategic decision making. The research methodology incorporated was in three phases. The first phase is data collection. At this stage, data is collected through a systematic structured questionnaire and imported to Tableau Software using an Excel sheet document. In the second phase, data analytics is conducted using a descriptive analytics and clustering technique using Tableau software by understanding the data with meaningful outcomes through visualization. In the third phase, the results can answer questions about descriptive analytics through visualization, helping Flipkart to make better strategic decisions. Finally, we present findings and implications and future research directions for the e-commerce industry using descriptive analytics and data visualization. R2: What is data visualization and how does it help marketers to take strategic decisions in the e-commerce industry? The definition of data visualization is already discussed in the literature review section. We developed a case and proposed model to bring meaningful insights.
A CASE STUDY ON E-RETAILER IN INDIA: FLIPKART India is one of the fastest-developing countries in the world, with the second largest global population (PWC, 2017). It accounts for about 3 percent of global consumption, having the highest usage in terms of growth among the top 10 countries ranked by size of Household Final Consumption Expenditure (HFCE) (PWC, 2017). Due to the fast growth of technology, the retail sector has expanded into the e-retail sector in the current digital era. E-commerce brings a transformation to business. Also, the Indian e-commerce market is predicted to increase to US $200 billion by 2026 and constant progress in the industry has been triggered by the rise of the internet and massive usage of smartphones (IBEF, 2020). In August 2020, the number of internet users increased significantly to approximately 760 million under the ‘Digital India’ program (IBEF, 2020). During Covid-19 it was shown that consumers were spending less, the economy was slowing down and e-retailers were facing uncertain circumstances. Furthermore, market researchers predicted that sales would go up after 2021. Due to the pandemic, all operations shifted to e-commerce, including groceries, pharmacy, health consultations, and so on. This is one of the greatest opportunities for e-commerce to engage customers and also to build trust (IBEF, 2020). There was also a great business opportunity for
94 Handbook of big data research methods global players, for example Facebook is investing in Reliance Jio in the e-commerce market (ET, 2020). Subsequently, Indian policies support 51 percent of FDI in multi-brands in the e-retail industry (IBEF, 2020). In our research, we selected a case on Flipkart, an Indian origin e-retail industry founded in 2007 by Sachin Bansal and Binny Bansal. It is one of India’s leading e-commerce companies and is headquartered in Bengaluru. The company initially started with an online book store before expanding into selling mobile phones. Currently, the company offers more than 80 million brands categorized into 80 segments (Hire Digital, 2020). The company has the capability to deliver to 8 million customers per month (Hire Digital, 2020; IBEF, 2020). Also, the company raised an additional US$1.2 billion from Walmart as a big giant investor in 2020 (IBEF, 2020). Its total valuation was around US$24.9 billion post equity. The company has committed to transition to using electric vehicles by 2030 across its e-retailing in collaboration with the Climate Group’s global electric mobility initiative, EV100, to achieve SDGs (Hire Digital, 2020; IBEF, 2020). Due to the rise of the digital transformation in India, Flipkart implemented artificial intelligence-enabled marketing services bringing the greatest business opportunities by increasing the number of internet users purchasing products through smartphones (Hire Digital, 2020). The integration of AI in sales creates more rigorous service differentiation, a better user experience in personalization, and manages customer expectations to maintain trust, redefining the customer shopping patterns with convenient and affordable prices (Hire Digital, 2020). Further, we proposed a conceptual model shown in Figure 6.1 for data visualization by creating graphs/ visual displays that serve firms’ specific objectives to bring data-driven decision making using descriptive analytics.
Figure 6.1
Conceptual model
Descriptive analytics and data visualization in e-commerce 95 Research Methodology Sample Size: A survey was conducted using a questionnaire, which received limited response due to the first wave of Covid-19. We obtained 53 responses from junior and middle-level employees, start-up entrepreneurs and management academicians by using a random sampling technique. Research Design The survey was conducted through a questionnaire and circulated online. It was divided into two sections. In the first section, the questionnaire focused on capturing the demographic profile of the respondents and the second section consisted of questions that are related to measuring the proposed conceptual model of descriptive analytics in Flipkart using a five-point Likert scale. Further, Tableau software was incorporated for data analysis using descriptive analytics using cluster analysis. The clustering techniques refer to the process of classifying data sets into many groups depending on their similarity. It is a simple and easy data mining technique that enables class categorization (Marzouk and Enaba, 2019). Questionnaire details related to the Flipkart case are represented in Table 6.1. Table 6.1
Details of the questionnaire: Flipkart case
Sl No
Labeling Data
Description
1
Demographic data
Age, gender, qualification, city
2
Social media ad
Social media influence to buy products in Flipkart more often
3
Flipkart stories
Stories streamed on YouTube, Instagram, Facebook, Twitter
4
AI-enabled content in advertising
Photos, Videos, Stories
5
AI helps to buy the right product in Flipkart
Decision making on purchasing products and pattern
Findings and Analysis Data visualization is created through Tableau. This is a data literacy tool that provides the information in a graphical presentation by using data. The visual components are charts, graphs and maps to understand the real-time trends, relationships and data patterns (Tableau. com, 2019). In this case, descriptive analytics is explained by data visualization and helps the researcher to discover the association between the demographic variables and enhances the graphic portrait to take strategic decisions related to sales (Batt et al., 2020). Descriptive analytics gives meaningful insight from unstructured data via median, average and trendlines, helping Flipkart to increase sales and leading to better customization of services (Luo et al., 2018). Furthermore, this data analysis can be created through dashboard visualization. The dashboard user must have a knowledge of the visual cortex, and acquire several visual displays to match the data sets at an optimal level through ordinal or categorical attributes (Susnjak et al., 2022). Each data value is represented by different colors, which differ in several spatial positions or variations in symbols with their length, size and shape (Susnjak et al., 2022). Also, the user must have domain competency in learning theoretical aspects with technical capabilities while creating dashboards (Klerkx et al., 2017).
96 Handbook of big data research methods In this study, the mean or average is calculated to find a central tendency of data with respect to gender, age and highest qualification of respondents. The mean or average value of gender, age and highest qualification is represented in the dashboard. In addition, the median value is calculated for both male and female respondents, which provides the ascending or descending order shown in Table 6.2. Table 6.2
Median value of demographic data
Median value for female
Median value for male
Upper Whisker: 8
Upper Whisker: 29
Upper Hinge: 6
Upper Hinge: 17
Median: 3.5
Median: 3.5
Lower Hinge: 2
Lower Hinge: 1.5
Lower Whisker: 1
Lower Whisker: 1
The descriptive analysis of data related to AI influences customer satisfaction and product decision making at less time using standard deviation, represented through data visualization along with the numerical values, as shown in Figures 6.2 and 6.3.
Note: Moving Average of Count of Form Responses 1 for each Customer Satisfaction broken down by AI influence. Shades of gray show details about Customer Satisfaction. The marks are labeled by Customer Satisfaction and count of Form Responses 1.
Figure 6.2
AI influence and customer satisfaction in Flipkart
Descriptive analytics and data visualization in e-commerce 97
Note: Count of Form Responses 1 for each Less time. Shades of gray show details about maximum of Less time. The marks are labeled by count of Form Responses 1.
Figure 6.3
Median value: less time to make a decision on product buying in Flipkart
Note: Shades of gray show details about Flipkart Stories. Size shows count of Form Responses 1. The marks are labeled by Flipkart Stories.
Figure 6.4
Cluster of Flipkart stories
98 Handbook of big data research methods Table 6.3
Inputs for cluster analysis
Variables:
Count of Form Responses
Level of Detail:
Flipkart Stories
Scaling:
Normalized
Table 6.4
Cluster analysis
Number of Clusters:
4
Clusters
Number of items
Centre count form responses
Number of Points:
4
Cluster 1
1
19.0
Between-group Sum of Squares:
0.58105
Cluster 2
1
15.0
Within-group Sum of Squares:
0.0
Cluster 3
1
16.0
Total Sum of Squares:
0.58105
Cluster 4
1
3.0
Table 6.5
Cluster model: analysis of variance
Variable
Model F-statistic
p-value
Sum of Squares
Error DF
Sum of
DF
Squares Count of Form
0
0.01
0.5811
3
0.5811
0
Responses
Clustering We introduced cluster analysis in Flipkart stories (see Figure 6.4) in various social media websites to understand the market segments better. This was also to discover the relationship between prospective customers and how they are extensively engaged in Flipkart regarding buying a new product, positioning and pricing decisions on digital platforms. This analysis helps us to understand the Indian market, data pattern recognition, data analysis, and voice or image recognition, which are available in Flipkart stories. The calculation of cluster analysis is as follows. In the cluster model in Table 6.5, the study discovers that the unsupervised machine learning algorithm tends to segregate the data based on metrics through data visualization. The cluster pattern calculated p-value is 0.01, which shows the significance and hence the model has a great impact on increasing sales through cluster analysis. The study summarizes the results of the descriptive analytics in the Flipkart e-commerce industry. The average of social media influencing customers to purchase the product has a value of 10.6. Also, the median value of 13.0 result signifies that customers buy the products based on customer reviews, ratings, product descriptions, social media posts, and so on. Furthermore, the customer buying frequency was increased based on customization, which was observed in the standard deviation (SD) value of 18.39. Hence descriptive analytics benefits Flipkart, resulting in increased sales, retention of old customers, acquisition of new customers, better services with customizations, organization performance, and so on. In addition, the constant change in global markets is quite a challenging situation for the e-commerce industry based on customer demand (Kumar and Reinartz, 2012). To sustain themselves in the competition, Flipkart gathered more customer data to bring novel and innovative promotions, offers/ discounts, and good service to increase sales (Kumar and Reinartz, 2012).
Descriptive analytics and data visualization in e-commerce 99
IMPLICATIONS, CHALLENGES AND FUTURE RESEARCH Research Implications Using descriptive analytics, the study discovered that there was a significant impact on the p-value of 0.01 by using cluster analysis. Further, this p-value is a modest value that is less than 0.05 (Taylor, 1990; Behera et al., 2021). Through descriptive analytics, by creating the dashboards, it was found that social media engages more customers to make buying decisions based on reviews, ratings and product description. With the integration of automation, Flipkart can provide more customization and can reach a maximum number of customers. This signifies that descriptive analytics through data visualization in Flipkart achieves a competitive advantage. Furthermore, by data analysis, we found that the customer bought more products in Flipkart and also time was saved by using median analysis (Pearl, 2014). With the help of AI, a couple of changes happened in Flipkart by increasing the number of new customers/ visitors able to purchase products more frequently. The results of our analysis reveal that embracing analytics in the sample confirmed that due to the automation-enabled process, new users and sales in Flipkart increased by an average of 10–15 percent. Descriptive analytics using multiple methods makes a positive impact on the outcome. Also, the evidence and interpretation of dashboards enabled Flipkart to make strategic decisions and eventually increase female customers’ purchases. This research validates previous research (Behera et al., 2021; Yerpude and Singhal, 2019; Sedkaoui, 2018) and our results are more consistent, helping Flipkart to increase their revenue, enhance customer satisfaction, and improve new product development to bring more agility to their data-driven decisions. The research contributions emphasized the importance of data visualization by creating dashboards that help to address customer queries, improve customer experience and willingness to purchase the products more often. Therefore, we believe that our novel findings using limited data sets and the deployment of descriptive analytics using dashboards in e-commerce create business value in the model presented in Figure 6.1, which applies to any descriptive solutions. The implications for management are that they must be aware of the cluster models to discover the different categorizations in order to enter a new segment in the e-commerce industry. In addition, industry experts and managers should understand that descriptive analytics provides a novel approach for the transition of product-centric to customer-centric operations by using automation as an emerging trend in the e-commerce industry. Today, the market is becoming more complex due to high technological advancements, and descriptive analytics is an efficient way to highlight the data, while managers must possess data insight for strategic decision-making. In the fast-growing marketplace, managers and retailers need to understand the proactive way to leverage and understand the data from various sources to achieve actionable insights such as sales forecasting, pricing, buying decisions, and so on. Data-driven strategic decisions are one of the most important crucial capabilities that managers can learn when leading their firms in the dynamic business environment. The greatest challenge for firms is the lack of data skills, expertise, investments, tools and technology, which are still at an early stage. Also, junior and middle-level employees have to be trained in data-driven initiatives within diversified cultures. The biggest hurdle is how to collect all the scattered data into a single platform where they are able to understand customers’ requirements to bring both customer equity and brand equity.
100 Handbook of big data research methods Lastly, future research can be carried out for both predictive and prescriptive analysis in the e-commerce industry to solve the various financial and operational risks and challenges to compete in the global market. Also, predictive analytics for the e-commerce industry needs to be studied using a segmentation approach to enhance strategic e-retailing decisions.
REFERENCES Abdullah, A.S., Selvakumar, S. and Abirami, A.M. (2017). An introduction to data analytics: Its types and its applications. In S.K. Trivedi, S. Dey, A. Kumar and T.K. Panda (eds), Handbook of Research on Advanced Data Mining Techniques and Applications for Business Intelligence. Hershey, PA: IGI Global, pp. 1–14. Akter, S. and Wamba, S.F. (2016). Big data analytics in E-commerce: A systematic review and agenda for future research. Electronic Markets, 26(2), 173–94. Ali, N. and Peebles, D. (2013). The effect of Gestalt laws of the perceptual organization on the comprehension of three-variable bar and line graphs. Human Factors, 55(1), 183–203. Allen, E.A., Erhardt, E.B. and Calhoun, V.D. (2012). Data visualization in the neurosciences: Overcoming the curse of dimensionality. Neuron, 74(4), 603–608. Allwood, D., Hildon, Z. and Black, N. (2013). Clinicians’ views of formats of performance comparisons. Journal of Evaluation in Clinical Practice, 19(1), 86–93. Anderson, J.C. and Muller, J.M. (2011). The effects of experience and data presentation format on an auditing judgment. Journal of Applied Business Research, 21(1), 53–63. Appelbaum, D., Kogan, A., Vasarhelyi, M. and Yan, Z. (2017). Impact of business analytics and enterprise systems on managerial accounting. International Journal of Accounting Information Systems, 25(March), 29–44. Ashman, R. and Patterson, A. (2015). Seeing the big picture in services marketing research: Infographics, SEM and data visualisation. Journal of Services Marketing, 29(6–7), 613–21. Aswani, R., Kar, A.K. and Vigneswara Ilavarasan, P. (2018a). Detection of spammers in Twitter marketing: A hybrid approach using social media analytics and bio inspired computing. Information Systems Frontiers, 20(3), 515–30. Aswani, R., Kar, A.K., Ilavarasan, P.V. and Dwivedi, Y.K. (2018b). Search engine marketing is not all gold: Insights from Twitter and SEOClerks. International Journal of Information Management, 38(1), 107–16. Balducci, B. and Marinova, D. (2018). Unstructured data in marketing. Journal of the Academy of Marketing Science, 46(4), 557–90. Banerjee, A., Bandyopadhyay, T. and Acharya, P. (2013). Data analytics: Hyped up aspirations or true potential? Vikalpa, 38(4), 1–12. Batt, S., Grealis, T., Harmon, O. and Tomolonis, P. (2020). Learning Tableau: A data visualization tool. The Journal of Economic Education, 51(3–4), 317–28. Bedeley, R.T., Ghoshal, T., Iyer, L.S. and Bhadury, J. (2018). Business analytics and organizational value chains: A relational mapping. Journal of Computer Information Systems, 58(2), 151–61. Behera, R.K., Bala, P.K., Tata, S.V. and Rana, N.P. (2021). Retail atmospherics effect on store performance and personalised shopper behaviour: A cognitive computing approach. International Journal of Emerging Markets. Benbasat, I. and Dexter, A. (1986). An investigation of the effectiveness of color and graphical information presentation under varying time constraints. MIS Quarterly, 10(1), 59–83. Bergmann, M., Brück, C., Knauer, T. and Schwering, A. (2020). Digitization of the budgeting process: Determinants of the use of business analytics and its effect on satisfaction with the budgeting process. Journal of Management Control, 31(1), 25–54. Berman, R. and Israeli, A. (2021). The added value of data-analytics: Evidence from online retailers. Working paper. Borkin, M.A., Bylinskii, Z., Kim, N.W., Bainbridge, C.M., Yeh, C.S., Borkin, D., Pfister, H. et al. (2015). Beyond memorability: Visualization recognition and recall. IEEE Transactions on Visualization and Computer Graphics, 22(1), 519–28.
Descriptive analytics and data visualization in e-commerce 101 Brynjolfsson, E. and McElheran, K. (2016). Data in action: Data-driven decision making in US manufacturing. US Census Bureau Center for Economic Studies Paper No. CES-WP-16-06, Rotman School of Management Working Paper, 2722502. Carpenter, P.A. and Shah, P. (1998). A model of the perceptual and conceptual processes in graph comprehension. Journal of Experimental Psychology: Applied, 4(2), 75. Caughlin, D.E. and Bauer, T.N. (2019). Data visualizations and human resource management: The state of science and practice. Research in Personnel and Human Resources Management, 37, 89–32. Chae, B.K., Yang, C., Olson, D. and Sheu, C. (2014). The impact of advanced analytics and data accuracy on operational performance: A contingent resource based theory (RBT) perspective. Decision Support Systems, 59, 119–12. Chang, Y. (2019). Spectators’ emotional responses in tweets during the Super Bowl 50 game. Sport Management Review, 22(3), 348–62. Chaudhuri, S., Dayal, U. and Narasayya, V. (2011). An overview of business intelligence technology. Communications of the ACM, 54(8), 88–98. Chen, H., Chiang, R.H. and Storey, V.C. (2012). Business intelligence and analytics: From big data to big impact. MIS Quarterly, 36(4), 1165–88. Cleveland, W.S., Diaconis, P. and McGill, R. (1982). Variables on scatterplots look more highly correlated when the scales are increased. Science, 216(4550), 1138–41. Croxton, F.E. and Stein, H. (1932). Graphic comparisons by bars, squares, circles, and cubes. Journal of the American Statistical Association, 27(177), 54–60. Davenport, T.H. and Dyché, J. (2013). Big data in big companies. International Institute for Analytics, 3(1–31). Davenport, T.H. and Harris, J.G. (2007). The architecture of business intelligence. In Competing on Analytics: The New Science of Winning. Boston, MA: Harvard Business School Press. Davenport, T.H., De Long, D.W. and Beers, M.C. (1998). Successful knowledge management projects. MIT Sloan Management Review, 39(2), 43. Delen, D. and Ram, S. (2018). Research challenges and opportunities in business analytics. Journal of Business Analytics, 1(1), 2–12. Dilla, W., Janvrin, D.J. and Raschke, R. (2010). Interactive data visualization: New directions for accounting information systems research. Journal of Information Systems, 24(2), 1–37. Dill, J., Earnshaw, R., Kasik, D., Vince, J. and Wong, P.C. (eds) (2012). Expanding the Frontiers of Visual Analytics and Visualization. Berlin and Heidelberg: Springer Science & Business Media. Eick, S.G. and Wills, G.J. (1995). High interaction graphics. European Journal of Operational Research, 81(3), 445–59. Eilam, B. and Poyas, Y. (2010). External visual representations in science learning: The case of relations among system components. International Journal of Science Education, 32(17), 2335–66. ET (2020). CCI okays Facebook’s investment of Rs 43,574 crore in Jio Platforms. The Economic Times. Retrieved 12 December 2021 at https://economictimes.indiatimes.com/tech/internet/cci-okays -facebooks-investment-in-jio-platforms/articleshow/76561345.cms. Finholm, C.E. and Shreiner, T.L. (2022). A lesson in missed opportunities: Examining the use of data visualizations in online history lessons. Social Studies Research and Practice. Fleming, O., Fountaine, T., Henke, N. and Saleh, T. (2018). Ten red flags signaling your analytics program will fail. McKinsey Quarterly, 14. Ghofrani, F., He, Q., Goverde, R.M. and Liu, X. (2018). Recent applications of big data analytics in railway transportation systems: A survey. Transportation Research Part C: Emerging Technologies, 90, 226–46. Ghoshal, T., Bedeley, R.T., Iyer, L.S. and Bhadury, J. (2018). Business analytics capabilities and use: A value chain perspective. In A.V. Deokar et al. (eds), Analytics and Data Science. Cham: Springer, pp. 41–54. Gifford, T. (2013). Integrated analytics in transportation and logistics. In Proceedings of INFORMS Conference on Business Analytics & Operations Research, pp. 1–39. Giri, C., Thomassey, S. and Zeng, X. (2019). Customer analytics in fashion retail industry. In A. Majumdar, D. Gupta and S. Gupta (eds), Functional Textiles and Clothing. Singapore, Springer, pp. 349–61.
102 Handbook of big data research methods Goran, J., LaBerge, L. and Srinivasan, R. (2017). Culture for a digital age. McKinsey Quarterly, 3(1), 56–67. Gorman, M.F. and Klimberg, R.K. (2014). Benchmarking academic programs in business analytics. Interfaces, 44(3), 329–41. Grandhi, B., Patwa, N. and Saleem, K. (2020). Data-driven marketing for growth and profitability. EuroMed Journal of Business, 16(4), 381–98. Hayashi, Y., Hsieh, M.H. and Setiono, R. (2010). Understanding consumer heterogeneity: A business intelligence application of neural networks. Knowledge-Based Systems, 23(8), 856–63. He, W., Tian, X., Chen, Y. and Chong, D. (2016). Actionable social media competitive analytics for understanding customer experiences. Journal of Computer Information Systems, 56(2), 145–55. Hindle, G.A. and Vidgen, R. (2018). Developing a business analytics methodology: A case study in the foodbank sector. European Journal of Operational Research, 268(3), 836–51. Hire Digital (2020). Hire Digital Insights. Retrieved 12 December 2021 at https://hiredigital.com/blog/ how-flipkart-is-using-artificial-intelligence. Holsapple, C., Lee-Post, A. and Pakath, R. (2014). A unified foundation for business analytics. Decision Support Systems, 64, 130–41. Hoyt, R.E., Snider, D.H., Thompson, C.J. and Mantravadi, S. (2016). IBM Watson analytics: Automating visualization, descriptive, and predictive statistics. JMIR Public Health and Surveillance, 2(2), e5810. IBEF (2020). Flipkart Internet Pvt Ltd. Retrieved 12 December 2021 at https://www.ibef.org/industry/ ecommerce/showcase/flipkart-internet-pvt-ltd. Jalali, S.M.J. and Park, H.W. (2018). State of the art in business analytics: Themes and collaborations. Quality & Quantity, 52(2), 627–33. Johnson, D.S., Muzellec, L., Sihi, D. and Zahay, D. (2019). The marketing organization’s journey to become data-driven. Journal of Research in Interactive Marketing, 13(2), 162–78. Kelly, J.D. (1993). The effects of display format and data density on time spent reading statistics in text, tables and graphs. Journalism Quarterly, 70(1), 140–49. Kim, Y.H. and Ahn, J.H. (2016). A study on the application of big data to the Korean college education system. Procedia Computer Science, 91, 855–61. Klerkx, J., Verbert, K. and Duval, E. (2017). Learning analytics dashboards. In Handbook of Learning Analytics. Society for Learning Analytics Research, pp. 143–50. Kotler, P. and Keller, K.L. (2012). Prentice Hall Video Library to Accompany Marketing Management. Pearson/Prentice Hall. Kumar, U.D. (2017). Business Analytics: The Science of Data-driven Decision Making. Wiley. Kumar, V. and Reinartz, W. (2012). Customer relationship management issues in the business-to-business context. In Customer Relationship Management. Springer, pp. 261–77. Lewandowsky, S. and Spence, I. (1989–1990). The perception of statistical graphs. Sociological Methods & Research, 18(2–3), 200–242. Luo, Z., Hsieh, J.T., Balachandar, N., Yeung, S., Pusiol, G., Luxenberg, J., Li, G. et al. (2018). Computer vision-based descriptive analytics of seniors’ daily activities for long-term health monitoring. Machine Learning for Healthcare Conference (MLHC), 2, 1. Mamo, Y., Su, Y. and Andrew, D.P. (2021). The transformative impact of big data applications in sport marketing: Current and future directions. International Journal of Sports Marketing and Sponsorship. Mariani, M., Bresciani, S. and Dagnino, G.B. (2021). The competitive productivity (CP) of tourism destinations: An integrative conceptual framework and a reflection on big data and analytics. International Journal of Contemporary Hospitality Management, 33(9), 2970–3002. Martínez-Martínez, A., Navarro, J.G.C., García-Pérez, A. and Moreno-Ponce, A. (2019). Environmental knowledge strategy: Driving success of the hospitality industry. Management Research Review, 42(6), 662–80. Martínez-Martínez, A., Suárez, L.M.C., Montero, R.S. and del Arco, E.A. (2018). Knowledge management as a tool for improving business processes: An action research approach. Journal of Industrial Engineering and Management, 11(2), 276–89. Marzouk, M. and Enaba, M. (2019). Analyzing project data in BIM with descriptive analytics to improve project performance. Built Environment Project and Asset Management, 9 (4), 476–88. Mathur, P. (2019). Overview of machine learning in retail. In Machine Learning Applications Using Python (pp. 147–57). Berkeley, CA: Apress.
Descriptive analytics and data visualization in e-commerce 103 McCarthy, R.V., McCarthy, M.M. and Ceccucci, W. (2022). Introduction to predictive analytics. In Applying Predictive Analytics (pp. 1–26). Cham: Springer. MacKay, D.B. and Villarreal, A. (1987). Performance differences in the use of graphic and tabular displays of multivariate data. Decision Sciences, 18(4), 535–46. Mintz, O., Bart, Y., Lenk, P. and Reibstein, D. (2019). Drowning in metrics: How managers select and trade-off metrics for making marketing budgetary decisions. Accessed at https://opus.lib.uts.edu.au/ bitstream/10453/138409/2/MetricTradeoffs%20MSI%20Official.pdf. Morgan, B. (2019), Descriptive analytics, prescriptive analytics and predictive analytics for customer experience, Forbes, 21 February. Retrieved 10 March 2022 at https://www.forbes.com/sites/ blakemorgan/2019/02/21/descriptive-analytics-prescriptive-analytics-and-predictive-analytics-for -customer-experience/?sh=5ef3de9969e0. Mortenson, M.J., Doherty, N.F. and Robinson, S. (2015). Operational research from Taylorism to Terabytes: A research agenda for the analytics age. European Journal of Operational Research, 241(3), 583–95. Müller, O., Fay, M. and Vom Brocke, J. (2018). The effect of big data and analytics on firm performance: An econometric analysis considering industry characteristics. Journal of Management Information Systems, 35(2), 488–509. Oswald, F.L., Behrend, T.S., Putka, D.J. and Sinar, E. (2020). Big data in industrial-organizational psychology and human resource management: Forward progress for organizational research and practice. Annual Review of Organizational Psychology and Organizational Behavior, 7, 505–33. Pearl, J. (2014). Interpretation and identification of causal mediation. Psychological Methods, 19(4), 459. PricewaterhouseCoopers (PWC) (2017). The promise of Indian retail: From vision to execution. Retrieved 12 December 2021 at https://www.pwc.in/assets/pdfs/publications/2017/the-promise-of -indian-retail-from-vision-to-execution.pdf. Raghu, T.S. and Vinze, A. (2007). A business process context for knowledge management. Decision Support Systems, 43(3), 1062–1079. Raghupathi, W. and Raghupathi, V. (2013). An overview of health analytics. Journal of Health & Medical Informatics, 4(132), 2. Raghupathi, W. and Raghupathi, V. (2018). An empirical study of chronic diseases in the United States: A visual analytics approach to public health. International Journal of Environmental Research and Public Health, 15(3), 431. Raghupathi, W. and Raghupathi, V. (2021). Contemporary business analytics: An overview. Data, 6(8), 86. Rawat (2021), An overview of descriptive analysis Retrieved 10 March 2022 at https://www.analyticssteps .com/blogs/overview-descriptive-analysis. Rehman, M.H., Chang, V., Batool, A. and Wah, T.Y. (2016). Big data reduction framework for value creation in sustainable enterprises. International Journal of Information Management, 36(6), 917–28. Ridge, M., Johnston, K.A. and O’Donovan, B. (2015). The use of big data analytics in the retail industries in South Africa. African Journal of Business Management, 9(19), 688–703. Sahay, A. (2018). Business Analytics, Volume I: A Data-Driven Decision Making Approach for Business. Business Expert Press. Sahu, R., Dash, M. and Kumar, A. (eds) (2017). Applying Predictive Analytics Within the Service Sector. IGI Global. Sathiyanarayanan, M. and Turkay, C. (2017). Challenges and opportunities in using analytics combined with visualisation techniques for finding anomalies in digital communications. Paper presented at the 16th International Conference on Artificial Intelligence and Law, 12–16 June 2017, London. Accessed at https://openaccess.city.ac.uk/id/eprint/22830/. Sedkaoui, S. (2018). How data analytics is changing entrepreneurial opportunities? International Journal of Innovation Science, 10(2), 274–94. Shah, P. and Freedman, E.G. (2011). Bar and line graph comprehension: An interaction of top‐down and bottom‐up processes. Topics in Cognitive Science, 3(3), 560–78. Shanks, G. and Bekmamedova, N. (2012). Achieving benefits with business analytics systems: An evolutionary process perspective. Journal of Decision Systems, 21(3), 231–44. Sharma, A.M. (2020). Data visualization. In Data Science and Analytics, 1–22.
104 Handbook of big data research methods Sharma, R., Mithas, S. and Kankanhalli, A. (2014). Transforming decision-making processes: A research agenda for understanding the impact of business analytics on organisations. European Journal of Information Systems, 23(4), 433–41. Shen, K.Y. and Tzeng, G.H. (2016). Contextual improvement planning by fuzzy-rough machine learning: A novel bipolar approach for business analytics. International Journal of Fuzzy Systems, 18(6), 940–55. Sinar, E.F. (2015). Data visualization. In S. Tonidandel, E.B. King and J.M. Cortina (eds), Big Data at Work: The Data Science Revolution and Organizational Psychology. Abingdon: Routledge, pp. 115–57. Sinar, E.F. (2018). Data visualization: Get visual to drive HR’s impact and influence. Society for Human Resource Management (SHRM) Society for Industrial Organizational Psychology (SIOP) Science of HR White Paper Series. Susnjak, T., Ramaswami, G.S. and Mathrani, A. (2022). Learning analytics dashboard: A tool for providing actionable insights to learners. International Journal of Educational Technology in Higher Education, 19(1), 1–23. Tableau.com (2019). Data visualization beginner’s guide: A definition, examples, and learning resources. Retrieved 10 December 2021 at https:// www.tableau.com/learn/articles/data-visualization. Taylor, R. (1990). Interpretation of the correlation coefficient: A basic review. Journal of Diagnostic Medical Sonography, 6(1), 35–9. Vidgen, R., Shaw, S. and Grant, D.B. (2017). Management challenges in creating value from business analytics. European Journal of Operational Research, 261(2), 626–39. Vila, J. and Gomez, Y. (2016). Extracting business information from graphs: An eye tracking experiment. Journal of Business Research, 69(5), 1741–46. Wamba, S.F., Akter, S., Edwards, A., Chopin, G. and Gnanzou, D. (2015). How ‘big data’ can make big impact: Findings from a systematic review and a longitudinal case study. International Journal of Production Economics, 165, 234–46. Wang, G., Gunasekaran, A., Ngai, E.W. and Papadopoulos, T. (2016). Big data analytics in logistics and supply chain management: Certain investigations for research and applications. International Journal of Production Economics, 176, 98–110. Watson, H.J. (2009). Tutorial: Business intelligence – past, present, and future. Communications of the Association for Information Systems, 25(1), 39. Yerpude, S. and Singhal, T.K. (2019). ‘Custolytics’: Internet of Things based customer analytics aiding customer engagement strategy in emerging markets – an empirical research. International Journal of Emerging Markets, 16(1), 92–112. Yin, J. and Fernandez, V. (2020). A systematic review on business analytics. Journal of Industrial Engineering and Management, 13(2), 283–95. Zeng, D., Tim, Y., Yu, J. and Liu, W. (2020). Actualizing big data analytics for smart cities: A cascading affordance study. International Journal of Information Management, 54, 102156. Zhang, Y., Luo, H. and He, Y. (2015). A system for tender price evaluation of construction project based on big data. Procedia Engineering, 123, 606–14.
7. Application of big data Bayesian interrupted time-series modeling for intervention analysis Neha Chaudhuri and Kevin Carillo
1. INTRODUCTION Social processes are inherently time-bound, and no theory can be truly time-independent (Zaheer et al., 1999). It is extremely rare to find isolated data points to explain social phenomena, leading to unreliable findings from classical statistical methods. For instance, the common assumption of IID (independent and identically distributed) random variables is often unrealistic while trying to examine stock prices for two companies in the same sector (Lo and MacKinlay, 1988). This is because these companies would most likely be competitors, and therefore it is highly likely that their stock prices would not be independent, rather they would be temporally interdependent on each other. Consequently, aided by recent technological advancements in data generation and collection methods, social research has been increasingly integrating temporal dynamics into its theories and theoretical models. Data in the domains of business (Chaudhuri et al., 2021), economics, public policies (Feroze, 2020), healthcare (Ginsberg et al., 2009) and other areas of socio-scientific investigations are now often extracted in the form of time series, that is, as a sequence of observations. This, for instance, includes daily stock prices, frequent customer visits to retail stores (Chaudhuri et al., 2021), or annual hospital visits (Piroozi et al., 2018) to name but a few. The knowledge that can be obtained from the dynamic characteristics of datasets can help to produce accurate and more reliable forecasts of future observations and to design optimal frameworks that detect and explain the impacts of unforeseeable interventions as well (Fu, 2011). Such recent shifts in data choices and data collection designs necessitate renewed efforts for the development and application of novel time series modeling designs capable of handling dynamic and complex temporal datasets. However, while such time-ordered sequences of observations provide opportunities to draw richer insights, time series methodologies in social research are still scarce. The conventional application domains of time series analysis (such as economics or finance) have primarily focused on forecasting, as a result of which time series literature is dominated by models that are aimed at prediction, but not explanation (Jebb et al., 2015; Shmueli, 2010). For instance, the widely used conventional time series analysis methods such as ARIMA (AutoRegressive Integrated Moving Average) models produce effective results for shorter periods but have been found to be unreliable for longer periods of time as they lack the capacity to detect or explain unforeseeable social events that could result in deviations of actual temporal patterns from predicted trends (Davies et al., 1995; Zhao et al., 2019). Therefore, to address these growing methodological needs, this chapter proposes the use of time series modeling to examine the impacts of social interventions over time. Specifically, we advocate the application of Bayesian Interrupted Time Series (BITS) design as a quasi-experimental methodology to detect and assess the impact of interventions without 105
106 Handbook of big data research methods dismissing contextual and temporal factors from the analysis. To demonstrate the application of BITS modeling for intervention analysis, we use the case study of the Life-IP Clean Air Policy implemented in a Bulgarian city by the regulatory body to improve air quality in the city and its surrounding areas.
2. BACKGROUND 2.1
Intervention Analysis
Intervention analysis research assesses the impacts of social interventions and events (Sharma and Jain, 2020). It has a wide range of application domains such as the assessment of the impact of political endeavors or economic policies (Perles-Ribes et al., 2019), and more recently of the COVID-19 pandemic (Ågerfalk et al., 2020; Feroze, 2020). Researchers have primarily adopted hypothesis testing using paired t-tests for intervention analyses (Bhattacharyya et al., 2020), or, action research and design science approaches (Baskerville and Wood-Harper, 1996). Another popular approach for intervention analysis is the difference-in-difference analysis as a quasi-experimental approach that compares the impact of interventions between a treatment group (a group that has received an intervention) and a control group (a group that has not received it) (Ho et al., 2018). However, like econometric methods, this approach assumes that the treatment and control groups would have parallel outcome trends (that is, it assumes that the difference between groups would be constant over time), and that there would be no spillover effects (Petimar et al., 2019). These assumptions do not hold true for most of the real-world scenarios where there are high chances of changing relationships between the groups over time, or control group candidates could inadvertently and indirectly be affected by the social interventions applied in the treatment group (Castañeda et al., 2018). Therefore, in this chapter, we argue that intervention analysis research as a methodological approach can contribute to the growing practice-oriented research that aims to reliably detect and explain the impact of social interventions while also providing an alternative method for temporal evolution assessment (Siponen and Baskerville, 2018). More specifically, Bayesian structural time-series analysis (explained in subsequent sections) is a promising data science technique for the difference-in-difference approach to effectively evaluate the effect of interventions. 2.2
Interrupted Time Series for Intervention Analysis
Randomized control trials are the most widely used approach to examine the impact of social interventions, such as public policies for healthcare (Buvik et al., 2019; Lewin et al., 2009) for instance. However, these methods suffer from several limitations including substantial cost of research to cover the requirements for a sufficient number of societal communities for such examinations, lack of generalizable findings about societal influence processes over time, inability to consider the complex interrelationships among factors that directly influence society (e.g., direct interventions) and indirect factors (e.g., temporal factors) (Edwards et al., 1999). As a result of these challenges, social research in this domain has progressed rather slowly. This often compels policymakers and regulatory bodies to implement suboptimal untested intervention strategies and policies out of their aspirations to find immediate solutions that
Bayesian interrupted time-series modeling for intervention analysis 107 address the targeted social concerns. This has necessitated further research to find more efficient, reliable, and robust methods of evaluating social interventions and policies. As a result, the recent literature has shifted focus to an alternative non-randomized difference-in-difference design that has the potential to distinctly model the counterfactual values observed before and after an intervention (Biglan et al., 2000; Linden, 2015; Piroozi et al., 2018; Yang et al., 2021). This approach is termed Interrupted Time Series (ITS). In ITS design, data collection is done at multiple points of time, both pre- and post-intervention (Biglan et al., 2000). Data modeling before the intervention provides an estimate of the underlying trend without any interventions, which, when forecasted for points of time after the intervention, would yield counterfactual results for possible trends in absence of any interventions. The resulting differences between the counterfactual and observed temporal data after a social intervention can then be investigated. Time series data experiments have been widely used in the development and evaluation of interventions in a range of domains such as healthcare and medicine (Yang et al., 2021), education (Musila and Belassi, 2004), marketing (Dekimpe and Hanssens, 2000), and public policy (Chanley et al., 2000). Moreover, the implementation of ITS analysis techniques in the recent literature has provided promising findings on the impact of public policies on societies and governments. Despite its notable success in detecting and explaining intervention impact, ITS is yet to be adopted by social sciences as a relevant method for community interventionists. Current popular methods include segmented regression (Wagner et al., 2002). However, this method suffers from several limitations, including that it limits the intervention to a predetermined time point in the series, and is not efficient in handling autocorrelated and dynamic data points (Wagner et al., 2002). Particularly, these regression models assume that the pre-intervention mean and the population characteristics remain unchanged throughout the analysis period. These assumptions and limitations often lead to unreliable and ungeneralizable results over time for societal impacts. ITS design therefore offers a rigorous methodology to determine the effectiveness of complex social policies and interventions on outcomes in the real world. This baseline model is a widely used methodology for the evaluation of interventions where only the treatment group is present (Wang et al., 2013). This design is a strong alternative to the costly statistical methods that require both treatment and control group data (Linden, 2017). ITS uses counterfactual data as control group and the actual post-intervention data as treatment group for analysis (Wang et al., 2013). Moreover, recent statistical studies have also shown that this single group design has strong internal validity, especially when the entire population is the unit of measure for analysis (Turner et al., 2021). However, a vanilla ITS model can only detect changes in the time series before and after the intervention, failing to account for changes in temporal trends as a result of other factors, independent of the intervention, resulting in biased interpretations about the outcome of a certain social intervention (Linden and Adams, 2011). Therefore, it is necessary to use hybrid time series models that can overcome these challenges. As a result, scholars have proposed Bayesian interrupted time series (BITS) as a powerful hybrid tool for causal intervention analysis (Freni-Sterrantino et al., 2019). 2.3
Bayesian Interrupted Time Series Modeling
Bayesian time series modeling was initially developed to examine the impact of a marketing campaign on search-related visits to an advertiser’s website (Freni-Sterrantino et al., 2019).
108 Handbook of big data research methods Since then, BITS has been successfully applied to investigate interventions in the context of transport, waste management and healthcare (Freni-Sterrantino et al., 2019; Morrison et al., 2018; Yang et al., 2021). Originally developed in the eighteenth century, these Bayesian models have witnessed increased attention from social scientists with the advancements of technological capabilities (“Bayesian Statistics and Modelling”, 2021; Davies et al., 1995; Freni-Sterrantino et al., 2019; Kruschke and Liddell, 2018). A BITS model comprises two phases: baseline and treatment. In the baseline phase, data is extracted for several time points, with which the treatment effects are subsequently compared. This baseline phase serves as the control condition (Mendieta-Peñalver et al., 2018). In the treatment phase, the treatment (post-intervention) data is collected and modeled. These two-phase datasets are then fitted independently and compared to investigate the impact and effectiveness of the intervention under scrutiny. Bayesian models are superior to other peer intervention analysis methods such as ARIMA and ESM, as they are capable of handling temporal uncertainties introduced into data models while providing flexibility and modularity by accounting for sources of variation such as seasonality or larger trends (Mourtgos and Adams, 2021). Moreover, ARIMA and ESM approaches are more applicable for forecasting problems rather than impact analysis (Mourtgos and Adams, 2021; Perles-Ribes et al., 2019). Additionally, unlike ARIMA and ESM methods, BITS is capable of handling the temporal evolution of interventions by considering dynamic confidence intervals (Mendieta-Peñalver et al., 2018). Despite its several advantages, a review of the relevant literature revealed a key research gap with respect to the application of Bayesian time series models for intervention analysis. Therefore, this chapter intends to cover this research gap by developing a BITS model that can analyze time series data related to air quality of an European city and estimate the effectiveness of an air quality improvement policy adopted by the city regulatory body over time. This example is used as an illustration of how BITS modeling can provide a novel and effective approach to assess the impact of interventions, measures and policies in social sciences.
3.
CASE STUDY: LIFE-IP CLEAN AIR POLICY
To demonstrate the application of BITS modeling for intervention analysis, we use the case study of the Life-IP Clean Air Policy implemented in the Bulgarian region of Sofia by the regulatory body to improve air quality in the city and surrounding areas. When it comes to air quality management, Bulgaria is separated into six regions. One of the six regions, the Sofia Municipality, is located in an area with high levels of particulate matter (PM) pollution mostly caused by home heating and transportation. The municipality developed an Air Quality Program, known as Life-IP Clean Air Policy, and joined the Compact of Mayors Global Initiative (CMGI) to reduce greenhouse gas (GHG) emissions. The Sofia Municipality is thereby a member of the largest and first-of-its-kind global alliance of cities committed to climate leadership. This air quality program was then implemented in order to achieve several goals including achieving the EU mandated target of fewer than 35 days per year of above-average daily PM10 levels, reducing maximum PM10 levels measured over 24 hours, and reducing home coal and fire emissions to reduce PM2.5 and Sulphur dioxide (SO2) levels, among others. In what follows, we decided to focus our attention on the objectives set by Life-IP: the SO2 levels for Sofia city were extracted using Satellite Earth Observation data.
Bayesian interrupted time-series modeling for intervention analysis 109
4.
DATA COLLECTION AND ANALYSIS
4.1
Data Extraction
The air quality data for Sofia was downloaded from the Copernicus Atmosphere Data Store in raw NetCDF4 file format using CDS API. Air quality datasets are available with a spatial resolution of 10 km and a temporal resolution of 1 hour. Usable data were extracted from the NetCDF4 files using a python data pipeline. Air quality data (PM10 and SO2 data) for Sofia were then extracted for the July 2018 to July 2021 time period. All the data extracts were provided in micrograms/m3 unit. Table 7.1 summarizes some of the general features of the final dataset. Table 7.1
Major features of the data collected
Parameter
Value
Spatial resolution
0.1°
Temporal resolution
1 hour
Data availability
July 2018 – July 2021
Compounds
PM10, NO2, SO2
Unit
μg/m3
4.2
Data Analysis
To assess the impact of the Sophia air quality intervention, we used a Bayesian Interrupted Time Series (BITS) design that distinctly models the counterfactual values observed before and after the intervention (pre- versus post-intervention). This approach differs from more conventional methods as it uses Bayesian time series to generate estimates and it relies on model averaging to construct the best synthetic control to estimate the counterfactual values. The values associated with the causal impact correspond to the differences between the observed values after the intervention and the value that would have been obtained under normal circumstances. The model feeds in three forms of information: the time series data of the target variable before the intervention, the time series data of other comparable variables, and the prior knowledge about the model parameters. The variable of interest for this analysis was the temporal air quality level that had very high autocorrelation and potentially suffered from omitted variable bias (that is, possible presence of unknown or unobserved variables that affected air quality levels). Additionally, since the chosen case study is a regulatory body-initiated policy examination, there was no control group for analysis, justifying the need to adopt a non-randomized difference-in-difference design. We further used a change detection algorithm to determine if there was indeed a change in trends of the temporal variable. This change detection algorithm considers a sequence of independent random variables along with its probability density and checks if there is a significant change in trend based on any of the scalar parameters (Bernal et al., 2017). While applying change detection in our analysis, it is important to decide on the impact model for change detection. Possible kinds of impact models for change detection include: (a) level change, (b) slope change, (c) level change and slope change, (d) slope change following a lag, (e) tempo-
110 Handbook of big data research methods rary level change, (f) temporary slope change leading to a level change (Bernal et al., 2017). These different impact models are shown in Figure 7.1.
Source:
Bernal et al. (2017).
Figure 7.1
Level change, slope change, level change and slope change, slope change following a lag, temporary level change, temporary slope change leading to a level change
In addition to change detection, it is also necessary to detect the change point in the trend in order to detect the presence of an intervention and its impact. For this, we used CUMSUM, which stands for Cumulative Sum. This approach uses a log-likelihood ratio to identify changes and to determine the change point by iteratively estimating the means before and after the change point and by maximizing/minimizing the CUMSUM value depending on whether it is an increasing or decreasing trend respectively until the change point converges. Figure 7.2 summarizes the methodology followed in this chapter.
Figure 7.2
Research methodology
4.2.1 Software tools We used the R package, “CausalImpact”, developed by the Google Research team for the BITS model (Brodersen et al., 2015). The package was initially developed to assess the impact of market interventions, but it can equally be applied to other similar scenarios in epidemiol-
Bayesian interrupted time-series modeling for intervention analysis 111 ogy, health care, social science, and so on. Since this method does not rely on the presence of a control group, the temporal data for pre-intervention needs to be sufficiently long so that the model can adequately learn the temporal patterns and create the counterfactual trend that can serve as control dataset. Our model finally took as inputs: the target (actual post-intervention temporal data) and the control (counterfactual time series data for the same time period as treatment group, that is, a learned trend that could have been without intervention). The output of the model included a statistical significance of the intervention along with a set of three plots to observe the model parameters.
5. FINDINGS 5.1
Date of intervention
Based on the average duration of policy implementation suggested by earlier studies, the date of intervention for analysis was chosen to be nine months from the date on which the policies were proposed. The outcome of the CUMSUM change detection is shown in the graph in Figure 7.3. The trends show that there was a strong decreasing change that happened in May 2019. This change matched perfectly with the chosen air quality policy intervention period.
Figure 7.3 5.2
Detected change in SO2 levels for Sofia Municipality using CUMSUM detector
Policy Intervention Analysis on SO2 Level
The policy intervention impact is depicted in the graphs in Figure 7.4, which were generated using the CausalImpact package (Brodersen et al., 2015). The first graph shows the original curve of the target, Sofia City (solid line), along with the counterfactual curve (dashed line). The counterfactual values are obtained using a robust multivariate model. The higher the number of comparable time series, the greater is the robustness of the model. Ninety-five percent confidence intervals of the counterfactual value are also shown through the shaded region. The second graph shows the pointwise difference between the original value and the counterfactual value along with its 95 percent confidence interval.
112 Handbook of big data research methods
Figure 7.4
Intervention impact analysis on SO2 levels of Sofia
The vertical dotted line represents the intervention period. Finally, in the third chart, the pointwise values after the intervention period in the second chart are cumulated and plotted. Depending on the nature of the time series evolution, the curve in the chart varies. If the cumulative curve continuously stays below the zero lines, this represents that the intervention has caused a clear decrease in the evolution. Conversely, if the cumulative curve continuously progresses above the zero lines, this represents that the intervention has caused a significant increase in the evolution of the time series under study. Such increase or decrease is then judged to be significant when the associated confidence intervals do not cross the zero-level line. The statistical significance of policy intervention impact was validated using intervention analysis and shown in Table 7.2. The results indicate that the policy significantly impacted SO2 emissions in Sofia. The findings are explained in the following paragraphs. Table 7.2
Causal impact findings
Average
Cumulative
Actual
13.74
462854.4
Prediction (s.d.)
8.71 (0.38)
291017.6 (11.12)
95% CI
[8.64, 8.78]
[291015.42, 291019.779]
Absolute effect (s.d.)
−5.028 (0.38)
−171836.8 (11.12)
95% CI
[−4.95, −5.1]
[−171847.92, −171825.68]
Relative effect (s.d.)
−57.8% (−0.32%)
−57.8%
The results in the table belong to the post-intervention period of the dataset. We estimated the average of the counterfactual value of the air quality variable to be 13.74, while we found that the average of the actual trend in the post-intervention was 8.71 with a standard deviation of 0.38. We finally get the intervention effect by subtracting these averages. This intervention effect is the absolute effect with a value of −5.028 with a 95 percent confidence interval [−4.95, −5.1]. Similarly, the relative effect was −57.8 percent. The cumulative column of the
Bayesian interrupted time-series modeling for intervention analysis 113 findings table represents the cumulative effect by summing up individual air quality (hourly) data points for about four years of the total period of observation. Finally, we tested validity of these intervention effects and found that the Posterior probability of a causal effect was 96.8 percent, which meant that findings were reliable statistically. 5.3
Change Detection
As part of the analysis, our objective was not just to detect change and evaluate its impact but also to detect the type of impact model embodying the change triggered by the intervention. Therefore, we analyzed the slope and level change point estimates using the median. While level change median value was 9.2, the slope change median value was 0.16 with a p-value < 0.05. This indicated that the change in level was more significant than the slope change, that is, a sudden drop in air pollution as a result of the policy implementation (shown in Figure 7.5).
Figure 7.5
Type of change detection
6. DISCUSSION The detection of such impact models could potentially aid regulatory bodies to optimally implement and modify, whenever necessary, the effectiveness of policies and interventions. The objective of such society-level policies should be aimed at bringing about gradual changes in habits and behaviors of citizens that contribute to deteriorating air quality. Hence policies should be more targeted towards bringing about a significant slope change; this is because past studies have shown that it is often difficult for policies to maintain their effectiveness over longer periods of time when the policies generated significant level change with negligible slope change (Turner et al., 2021). One of the most significant strengths of the application use case of the ITS methodology shown in this chapter is its invulnerability to confounding variables which otherwise remain fairly constant over time, such as the socio-economic status of citizens. However, ITS could be fairly sensitive to variables that change rapidly over time, such as seasonality, climate, and so on. Hence the need for addition of Bayesian modeling to ITS design, which makes the hybrid model more robust to such dynamic data features.
114 Handbook of big data research methods
7. CONCLUSION We applied a Bayesian extension to the statistically robust ITS design to detect and evaluate the effectiveness of air quality policy implemented by the Sofia Municipality in Bulgaria. One of the core contributions of this methodological study includes the reduced cost of experimental intervention analysis due to non-requirement of a control group. The proposed BITS model is hence suitable and applicable for similar policy impact analysis domains using big data-driven time series modeling in quasi-experimental settings. It offers significant improvements over the conventional difference-in-difference methods by controlling implicitly for baseline trends and proposing a model to handle with potential auto-correlation present in data.
REFERENCES Ågerfalk, P.J., Conboy, K. and Myers, M.D. (2020). Information systems in the age of pandemics: COVID-19 and beyond. European Journal of Information Systems, 29(3), 203–207. Available at https://doi.org/10.1080/0960085X.2020.1771968. Baskerville, R.L. and Wood-Harper, A.T. (1996). A critical perspective on action research as a method for information systems research. Journal of Information Technology, 11(3). Available at https://doi .org/10.1080/026839696345289. Bayesian statistics and modelling (2021). Nature Reviews Methods Primers, 1(3). Available at https://doi .org/10.1038/s43586-020-00003-0. Bernal, J.L., Cummins, S. and Gasparrini, A. (2017). Interrupted time series regression for the evaluation of public health interventions: A tutorial. International Journal of Epidemiology, 46(1), 348–55. Available at https://doi.org/10.1093/ije/dyw098. Bhattacharyya, S., Banerjee, S., Bose, I. and Kankanhalli, A. (2020). Temporal effects of repeated recognition and lack of recognition on online community contributions. Journal of Management Information Systems, 37(2), 536–62. Available at https://doi.org/10.1080/07421222.2020.1759341. Biglan, A., Ary, D. and Wagenaar, A.C. (2000). The value of interrupted time-series experiments for community intervention research. Prevention Science, 1, 31–49. Available at https://doi.org/10.1023/ A:1010024016308. Brodersen, K.H., Gallusser, F., Koehler, J., Remy, N. and Scott, S.L. (2015). Inferring causal impact using Bayesian structural time-series models. Annals of Applied Statistics, 9(1), 247–74. Available at https://doi.org/10.1214/14-AOAS788. Buvik, A., Bugge, E., Knutsen, G., Småbrekke, A. and Wilsgaard, T. (2019). Patient reported outcomes with remote orthopaedic consultations by telemedicine: A randomised controlled trial. Journal of Telemedicine and Telecare, 25(8). Available at https://doi.org/10.1177/1357633X18783921. Castañeda, G., Chávez-Juárez, F. and Guerrero, O.A. (2018). How do governments determine policy priorities? Studying development strategies through spillover networks. Journal of Economic Behavior and Organization, 154, 335–61. Available at https://doi.org/10.1016/j.jebo.2018.07.017. Chanley, V.A., Rudolph, T.J. and Rahn, W.M. (2000). The origins and consequences of public trust in government: A time series analysis. Public Opinion Quarterly, 64(3), 239–56. Available at https://doi .org/10.1086/317987. Chaudhuri, N., Gupta, G., Vamsi, V. and Bose, I. (2021). On the platform but will they buy? Predicting customers’ purchase behavior using deep learning. Decision Support Systems, 149,113622. Davies, N., Pole, A., West, M. and Harrison, J. (1995). Applied Bayesian forecasting and time series analysis. Journal of the Royal Statistical Society. Series A (Statistics in Society). 158(3), 635–6. Available at https://doi.org/10.2307/2983458. Dekimpe, M.G. and Hanssens, D.M. (2000). Time-series models in marketing: Past, present and future. International Journal of Research in Marketing, 17(2–3), 183–93. Available at https://doi.org/10 .1016/s0167-8116(00)00014-8.
Bayesian interrupted time-series modeling for intervention analysis 115 Edwards, S.J.L., Braunholtz, D.A., Lilford, R.J. and Stevens, A.J. (1999). Ethical issues in the design and conduct of cluster randomised controlled trials. British Medical Journal, 318. Available at https://doi .org/10.1136/bmj.318.7195.1407. Feroze, N. (2020). Forecasting the patterns of COVID-19 and causal impacts of lockdown in top five affected countries using Bayesian Structural Time Series Models. Chaos, Solitons & Fractals, 140, 110196. Freni-Sterrantino, A., Ghosh, R.E., Fecht, D., Toledano, M.B., Elliott, P., Hansell, A.L. and Blangiardo, M. (2019). Bayesian spatial modelling for quasi-experimental designs: An interrupted time series study of the opening of Municipal Waste Incinerators in relation to infant mortality and sex ratio. Environment International, 128, 109–115. Available at https://doi.org/10.1016/j.envint.2019.04.009. Fu, T.C. (2011). A review on time series data mining. Engineering Applications of Artificial Intelligence, 24(1), 164–81. Available at https://doi.org/10.1016/j.engappai.2010.09.007. Ginsberg, J., Mohebbi, M.H., Patel, R.S., Brammer, L., Smolinski, M.S. and Brilliant, L. (2009). Detecting influenza epidemics using search engine query data. Nature, 457(7232), 1012–114. Ho, D.E., Sherman, S. and Wyman, P. (2018). Do checklists make a difference? A natural experiment from food safety enforcement. Journal of Empirical Legal Studies, 15(2), 242–77. Available at https:// doi.org/10.1111/jels.12178. Jebb, A.T., Tay, L., Wang, W. and Huang, Q. (2015). Time series analysis for psychological research: examining and forecasting change. Frontiers in Psychology, 6, 727. Kruschke, J.K. and Liddell, T.M. (2018). The Bayesian New Statistics: Hypothesis testing, estimation, meta-analysis, and power analysis from a Bayesian perspective. Psychonomic Bulletin and Review, 25, 178–206. Available at https://doi.org/10.3758/s13423-016-1221-4. Lewin, S., Glenton, C. and Oxman, A.D. (2009). Use of qualitative methods alongside randomised controlled trials of complex healthcare interventions: Methodological study. BMJ (Online). Available at https://doi.org/10.1136/bmj.b3496. Linden, A. (2015). Conducting interrupted time-series analysis for single- and multiple-group comparisons. Stata Journal, 15(2). Available at https://doi.org/10.1177/1536867x1501500208. Linden, A. (2017). A comprehensive set of postestimation measures to enrich interrupted time-series analysis. Stata Journal, 17(1). Available at https://doi.org/10.1177/1536867x1701700105. Linden, A. and Adams, J.L. (2011). Applying a propensity score-based weighting model to interrupted time series data: Improving causal inference in programme evaluation. Journal of Evaluation in Clinical Practice, 17(6), 1231–8. Available at https://doi.org/10.1111/j.1365-2753.2010.01504.x. Lo, A.W. and MacKinlay, A.C. (1988). Stock market prices do not follow random walks: Evidence from a simple specification test. Review of Financial Studies, 1(1), 41–66. Available at https://doi.org/10 .1093/rfs/1.1.41. Mendieta-Peñalver, L.F., Perles-Ribes, J.F., Ramón-Rodríguez, A.B. and Such-Devesa, M.J. (2018). Is hotel efficiency necessary for tourism destination competitiveness? An integrated approach. Tourism Economics, 24(1). Available at https://doi.org/10.5367/te.2016.0555. Morrison, C.N., Jacoby, S.F., Dong, B., Delgado, M.K. and Wiebe, D.J. (2018). Ridesharing and motor vehicle crashes in 4 US cities: An interrupted time-series analysis. American Journal of Epidemiology, 187(2), 224–32. Available at https://doi.org/10.1093/aje/kwx233. Mourtgos, S.M. and Adams, I.T. (2021). COVID-19 vaccine program eliminates law enforcement workforce infections: A Bayesian structural time series analysis. Police Practice and Research, 22(5), 1557–65. Available at https://doi.org/10.1080/15614263.2021.1894937. Musila, J. and Belassi, W. (2004). The impact of education expenditures on economic growth in Uganda: Evidence from time series data. The Journal of Developing Areas, 38(1), 123–33. Available at https:// doi.org/10.1353/jda.2005.0015. Perles-Ribes, J.F., Ramón-Rodríguez, A.B., Such-Devesa, M.J. and Moreno-Izquierdo, L. (2019). Effects of political instability in consolidated destinations: The case of Catalonia (Spain). Tourism Management, 70, 134–9. Petimar, J., Ramirez, M., Rifas-Shiman, S.L., Linakis, S., Mullen, J., Roberto, C.A. and Block, J.P. (2019). Evaluation of the impact of calorie labeling on McDonald’s restaurant menus: A natural experiment. International Journal of Behavioral Nutrition and Physical Activity, 16(99). Available at https://doi.org/10.1186/s12966-019-0865-7.
116 Handbook of big data research methods Piroozi, B., Takian, A., Moradi, G., Amerzadeh, M., Safari, H. and Faraji, O. (2018). The effect of Iran’s health transformation plan on utilization of specialized outpatient visit services: An interrupted time series. Medical Journal of the Islamic Republic of Iran, 32(1), 1–5. Available at https://doi.org/10 .14196/mjiri.32.121. Sharma, D. and Jain, S. (2020). Reduction in black carbon concentration and its exposure in rural settings of Northern India: An intervention analysis. Chemosphere, 2247. Available at https://doi.org/10.1016/ j.chemosphere.2020.125838. Shmueli, G. (2010). To explain or to predict? Statistical Science, 25(3), 289–310. Siponen, M. and Baskerville, R. (2018). Intervention effect rates as a path to research relevance: Information systems security example. Journal of the Association for Information Systems, 19(4). Available at https://doi.org/10.17705/1jais.00491. Turner, S.L., Karahalios, A., Forbes, A.B., Taljaard, M., Grimshaw, J.M. and McKenzie, J.E. (2021). Comparison of six statistical methods for interrupted time series studies: empirical evaluation of 190 published series. BMC Medical Research Methodology, 21(134). Available at https://doi.org/10.1186/ s12874-021-01306-w. Wagner, A.K., Soumerai, S.B., Zhang, F. and Ross-Degnan, D. (2002). Segmented regression analysis of interrupted time series studies in medication use research. Journal of Clinical Pharmacy and Therapeutics, 27(4), 299–309. Available at https://doi.org/10.1046/j.1365-2710.2002.00430.x. Wang, J.J.J., Walter, S., Grzebieta, R. and Olivier, J. (2013). A comparison of statistical methods in interrupted time series analysis to estimate an intervention effect. Proceedings of the 2013 Australasian Road Safety Research, Policing & Education Conference. Yang, Z.M., Wu, M.Y., Lu, J.M., Li, T.Z., Shen, P., Tang, M.L., Jin, M.J. et al. (2021). Effect of COVID-19 on hospital visits in Ningbo, China: An interrupted time-series analysis. International Journal for Quality in Health Care, 33(2). Available at https://doi.org/10.1093/intqhc/mzab078. Zaheer, S., Albert, S. and Zaheer, A. (1999). Time scales and organizational theory. Academy of Management Review, 24(4), 725–41. Zhao, K., Wulder, M.A., Hu, T., Bright, R., Wu, Q., Qin, H., Li, Y. et al. (2019). Detecting change-point, trend, and seasonality in satellite time series data to track abrupt changes and nonlinear dynamics: A Bayesian ensemble algorithm. Remote Sensing of Environment, 232. Available at https://doi.org/10 .1016/j.rse.2019.04.034.
8. How predictive analytics can empower your decision making Nadia Nazir Awan
INTRODUCTION One of the most advanced forms of big data analytics has become popular in this decade. This is the branch of engineering where two essential principles are used. The first is classification and the second is regression. It predicts the possibility of things and situations based on past information or availability of data and can handle both discontinuous as well as continuous changes (Nyce, 2007). These models are created comprising data mining methods and methods that extract important information from a big volume of data and analyse it by using the most updated algorithms to find hidden patterns or predicted information (Leventhal, 2010). Predictive analytics is concerned with making predictions of probabilities and trends, and it determines the likely future outcome of an event or the likelihood of specific conditions that occur in the present (Anonymous, 2020). Clustering techniques, decision analytics, neural networks, decision trees, genetic algorithms, regression modelling, text mining, hypothesis testing, and other methods are used to automatically analyse large amounts of data containing various factors. The predictor, which is commonly assessed for an individual or whole organization to anticipate future occurrences, hazards, and opportunities hidden within it, is the most basic aspect of predictive analytics (SAS, 2020). Traditional business intelligence (BI) is a combination of tools, applications, infrastructure, and best practices for aggregating data from many sources, preparing it for analysis, and then reporting and analysing it to improve decisions and performance (Salleh, 2013). These structures are developed to be used by analysts and are optimized for assisting managerial choices that need aggregated views of data from across the company, unit, or a department. Predictive analytics solutions transform unstructured data into useful knowledge. They let employees make decisions based on enormous amounts of data, which has a significant positive impact on equipment dependability and servicing. Employees can be more productive if they can recognize and diagnose equipment problems early on, arrange necessary maintenance, and avoid equipment failure. Predictive analytics tools can detect issues days, weeks, or months in advance, allowing personnel to become proactive (DeAngelis, 2015). Corporations can utilize different tools of predictive analytics to spend less time looking for potential issues and more time acting, thereby maximizing the value of each asset (Rajpurohit, 2014). They can find new revenue streams, improve profits and customer service, and improve operational efficiencies (Starns, 2020). Predictive analytics employs a variety of modelling techniques, including statistics, AI, and machine learning. The model is selected based on validation, testing and evaluation of the results. Each model has its own set of advantages and disadvantages and is best suited to certain types of issues. Predictive modelling is the process of running one or more algorithms on historical data to create a prediction (Siegel, 2013). The procedure entails training as well as testing of the same data with different models before selecting the best 117
118 Handbook of big data research methods model fit. This study is of paramount importance as it discovers the key process of predictive analytics and how it is beneficial across different industries. Second, the application of these analytics can be useful for companies in the long run, from cost-cutting to achieving a competitive advantage over rivals (Hagiu and Wright, 2020). In the current century, enterprises save massive quantities of diverse data from different causes. The business value of the insights contained in this Big Data is enormous. The greatest advantage comes from predictive analytics, which practises advanced analytical methodologies to forecast future events and influence decisions or actions. From all the participants’ responses to the top Big Data analytics capabilities, predictive analytics emerged as the most important. Today’s business leaders are looking for analytics tools that are customizable, tradeable, and easy to implement (Stedman, 2017). Customers, business analysts, and power users should all be able to use the analytics tools to satisfy their specific needs. Several market dynamics are driving predictive analytics, involving Big Data’s debut, increased computing capacity, a deeper knowledge of the technology’s utility, and the emergence of special economic pressures (Ventana Research, 2015). This study aims to explore what predictive analytics has to offer and how firms from different industries can use these analytics across all operations to maximize their shareholders’ equity and profitability. Several experts in this arena have explored the importance of predictive analytics and its applications are discussed in this chapter. Last but not the least, it discusses the areas one should avoid while implementing these analytics.
BENEFITS OF PREDICTIVE ANALYTICS Predictive analytics provides support to act proactively. It guides you to act and detect the high-risk patients in advance instead of waiting for them to die. This can be done by using electronic health records. This would require policymakers to make a shift from traditional screening (Raghupathi and Raghupathi, 2014). Again, this will require cost and time, but bringing this change will be a long-term benefit for the community. Predictive analytics increases the chances of detecting diseases or conditions, reducing the cost of unnecessary tests and lab results, and allowing medical practitioners to become efficient with treatment (Ahuja, 2019). Certain variables such as patient’s age, family background, healthy history, location of residence, and so on play a key role in determining the expected outcome (Braveman and Gottlieb, 2014). This is not the end. Understanding the model prediction is crucial. With the right variables and outcomes, the availability of the data set is also needed. For instance, patients with cancer and diabetes need to be treated instead of the ones with a runny nose. Treating high-risk patients will help to maximize the wellness of everyone. However, building this model requires investment in the technology (Davenport, 2006); there is a cost involved in gathering data and then providing a platform to store large data sets. Healthcare ministers can reap the benefits of predictive analytics only once they have invested in technology and created a system for their future research (Anonymous, 2020). One of the chief advantages of predictive analytics, which makes it quite attractive to financial institutions, is its use in detecting fraudulent transactions. Banks, insurance companies, and credit scoring organizations use it to minimize the level of risk. From security exchange to insurance companies and investment companies, they rely heavily on the use of predictive analytics. The major use is to check the creditworthiness of borrowers concerning whether they will be able
How predictive analytics can empower your decision making 119 to repay on time. Government-level organizations detect all kinds of fraudulent activities and protect consumers. Financial institutions also calculate the risk profile of customers and establish which customers will be more profitable to them. There are several examples of banks and institutions using predictive analytics (OECD, 2020). Example: PayPal uses it to stop any fraudulent activity when payment is processed. Similarly, insurance companies provide quick services to manage claims (Burns, 2016). The quality of all assessments, pop-up quizzes, reading materials, and, most important, an entire course syllabus can be of standard quality if predictive analytics is applied. Another terminology used is learning analytics which has helped students as well as teachers to reach their goals. Several measures can be taken to upgrade the quality of education by identifying the unnecessary information and reading material and focussing on the required curriculum (Martin and Ndoye, 2016). This will not only improve the students’ grades but also help them excel in their careers. Similarly, understanding what courses will attract a maximum number of student enrolments and what subjects will help students to showcase their skills in the industry is done by using predictive learning analytics. It also allows lecturers to identify low-performing students who could be on the brink of failing the course and develop ways to make subjects easy for them. This also becomes increasingly significant for students and makes them understand their academic score in a certain course or subject (Wells, 2016). The government in general is also benefiting from the use of predictive analytics. All kinds of planning to ensure the smooth experience of public transport. Companies in the logistics industry. Different logistics companies such as FedEx, UPS and DHL have accepted analytics to reduce expenses and increase operating efficiency. For freight businesses, predictive analytics ensures that ships remain in good repair (PwC, n.d.). Right Ship, a cargo company located in Melbourne, is using predictive analytics to better evaluate whether ships are ready to travel. Other government applications of analytics include traffic control, route planning, intelligent transportation systems, and congestion management (Richter et al., 2020). Analytics tools have been used to handle demand forecasting, integrated business planning, supplier collaboration, risk analysis, and inventory control. Analytics can assist businesses in meeting the increased expectations of customers who want items to arrive when they say they will (Davenport et al., 2011). Professionals in the supply chain need a comprehensive perspective of their inventory future events. Self-driving cars and robotics both employ data analytics (Qi, 2019). Predictive analytics helps you to understand what your customer wants, how you can please your customers, what offers attract more customers, and what are the ways through which sales can be increased by applying customer-focused strategies (Freedman, 2021). Examining the patterns of consumers’ purchases and analysing data to increase future sales is becoming increasingly common in the retail sector. Predictive analytics is also used in the retail and marketing area (Nicasio, 2021). Data analytics is used in the retail industry for fraud prevention, real-time inventory analysis, and personnel optimization, among other things. In addition, data analytics is being used by the sector to determine client responses and purchases, as well as to build up upselling and cross-selling prospects (Badole, 2021). Target, for example, use predictive analytics to determine which consumers are most likely to become pregnant soon based on their purchasing histories. The information is then used to approach potential consumers with offers targeted to new parents’ needs. Tesco is a large supermarket chain based in the United Kingdom that operates in 13 countries. Predictive analytics are used by the company to increase redemption rates at its grocery cash registers in 13 countries, where it distributes 100 million customized coupons each year (Attaran and Attaran, 2018).
120 Handbook of big data research methods We believe that both system initiated as well as consumer task completion across various applications and web services will become essential to handling one's personal life and increasing productivity at work as computers become smaller and more common [e.g., wearables and the Internet of Things (IoT)] and the number of applications rises. In this article, we provide an overview of personal digital assistants (PDAs), detail the technology, system architecture, and important elements that underlie them, and talk about how they could ultimately change how humans and computers interact with one another. Singles can use predictive analytics to meet other singles they might not have met otherwise. Netflix employs predictive analytics to decide which movies a member will appreciate based on what he or she has already seen, and Facebook uses analytics to increase the accuracy of recommended people you may know and want to link to. Google is a big fan of predictive analytics and uses it to change the way people search. All of Google's search results, including Google Instant, Google Suggests, and Google Auto Complete, are based on predictive analytics. Google has improved predictive search, acting as a personalized assistant that can anticipate your requirements, wants, and deep desires (Sarikaya, 2017). The term ‘HR analytics’ was coined to explain the effectiveness of HR programmes. Other terms that have lately gained popularity include people analytics, talent analytics, and workforce analytics (Janet H. Marler, 2017). The term ‘people analytics’ has largely replaced ‘business analytics’. People analytics analyses HR data using mathematics, statistics, and predictive modelling to find trends and make better judgements about all aspects of the HR strategy in order to improve organizational performance (Tursunbayeva et al., 2018). People analytics is becoming more popular as a tool for recruiting, pay negotiations, performance management, and employee retention. Similarly, people analytics has been used to estimate which employees are most likely to depart, to evaluate employee performance, and determine an employee’s bonus (Anderson et al., 2022). For instance, Hewlett-Packard has created a predictive analytics tool to estimate the likelihood of employees quitting US special forces, utilizing predictive analytics to determine if candidates will be productive and deserving of years of training. Predictive analytics is also being used by LinkedIn to label people’s profiles with skills it believes they possess depending on their written content (Austin and Pisano, 2017). Predictive analytics is used to detect flaws to assure safety and efficiency. In New York City, Con Edison estimates energy distribution cable failure and updates risk ratings three times per hour. Finally, Nokia Siemens Networks increases service availability by accurately forecasting client usage on its 4G wireless network (Kandhammal and Duraisamy, 2018). To anticipate faulty railway rails, BNSF employs Predictive Analytics. Predictive modelling of nuclear reactor failures is done by Argonne National Laboratory (Prokofiev et al., 2016). In this quickly evolving industry, policymakers must strike a balance between regulation and innovation. A more formal approach of validating machine learning and AI, similar to the field of predictive biomarkers, can realize the promise of predictive analytics while protecting patients, moving from incredible predictive potential to improved patient outcomes (Pesapane et al., 2018). Officials at the University of Maryland use predictive analytics to analyse grades, demographics, financial aid, course schedules, and enrolment status to identify at-risk students and improve retention rates (Attaran, Stark, & Stotler, 2018). They believe that predictive analytics enables officials to intervene with struggling students before they deteriorate. Analytics can assist officials in identifying bottlenecks and problems, such as a difficult class or other pressing issues that may cause students to drop out. It offers telecom firms a strategy for targeting different-value clients with suitable offers and services. Classification algorithms
How predictive analytics can empower your decision making 121 were applied based on the attachment levels as categories and the chosen attributes, the results compared, and the best classification model in terms of accuracy was chosen. Then loyalty prediction rules were built using this model, which described the relationship between behavioural variables and categorization groups, as well as the reasons for loyalty in each location. Appropriate offers and services were made available to target users. Another advantage of using classification algorithms was that it enabled you to create an accurate predictive model for categorizing new customers based on their loyalty (Shah et al., 2018). Because of its advantages as an alternative forecasting tool, decision tree, or tree-based, models have grown in popularity in different disciplines (e.g., psychology, ecology, and biology) in recent years (Sinha, Quan, McDonald, & Valdez, 2016). It explains the importance of building predictive models that account for dependencies. In actuarial science, insurance, and finance, we have witnessed in recent years the importance of capturing the dependency structure of a multivariate response when developing tools for prediction (Fominaya, 2016). Members of the industry are eager to reap the benefits of predictive analytics, and they intend to employ this management tool more in future years. Some expect to do so due to regulatory requirements, while others aim to grow their capabilities. Many of the organizations that are already utilizing predictive analytics intend to expand their use by leveraging existing capabilities and personnel – in other words, doing more with fewer resources. Although healthcare has lagged behind other industries in adopting technology, it has recently gained traction in areas such as the widespread adoption of electronic health records (EHRs) and IBM’s Watson as an artificial intelligence decision assistance system for physicians. Although both require refining and more use, the foundation for adopting predictive modelling as a tool is the same. Since 1988, the Freedom of Information and Protection of Privacy Act has been in place in British Columbia. On January 1, 2012, it was expanded to cover hospitals. In November 2004, Ontario passed the Personal Health Information Protection Act (PHIPA). The Health Information Protection Act includes two parts: the PHIPA and the HIPAA PHIPA (Lader, et el., 2004) (schedule A) and the Quality-of-Care Information Protection Act are two schedules of the Health Information Protection Act, which was also enacted in 2004. The Personal Health Information Protection Act (PHIPA) establishes guidelines for the collection, use, and sharing of personal health data. It's crucial to know ahead of time if the data can be incorporated in a predictive model due to regulatory and compliance constraints. Because of the large amount of data that will be accessible when electronic medical records become more widely adopted, there will be a higher risk of diagnostic and treatment errors because of having to handle so much data. Watson and other artificial intelligence applications in medicine may assist physicians in navigating a complex set of clinical conditions, laboratory findings, and imaging outcomes to start coming up with a set of potential clinical diagnosis and treatment and possible treatments that may improve patient outcomes and lower healthcare costs (Ferrucci et al., 2013). Predictive models and technology do not take the position of clinicians. Clinicians are not replaced by predictive models or technology. Clinicians have an important role in the entire patient care process, including diagnosing, developing treatment plans, and providing follow-up care.
122 Handbook of big data research methods
CHALLENGES However, having the right technologies to evaluate and store data isn’t enough to run an effective predictive analytics programme. It’s crucial how data is gathered, split, integrated, analysed, and used, and the information must be current and timely. A successful strategy also requires the hiring of individuals to convert the data and findings into useful information, all of which comes at a cost. For smaller businesses, these fees can be significant. This human aspect includes not just resources, time and skill, but also cultural acceptance, since some executives still ‘trust their gut’ on some decisions rather than embracing the predictive analytics outcomes. In this respect, many industry members believe that not only will changing company culture be difficult but that many of them are not necessarily even ready to start that change. Although equipment leasing and finance companies today are being deluged with data, the predictive analytics component of business intelligence remains large. Second, the gathering and storage of data is one of the most difficult aspects of predictive analytics with big data. The obstacles that come with this include re-searching the data and sharing it with the appropriate people (Bari, Chaouchi, & Jung, 2016). This also faces challenges in organizing data after extraction and integrating it, as well as risk reduction through eliminating clinical decision-making errors and other medical assistance. This could be addressed through the progress of digital therapy approaches, such as the use of mobile applications and smart sensors, among other things. Big data and predictive analytics are extremely beneficial in healthcare applications, such as detecting chronic illness risk scores, avoiding patient deterioration, and preventing patient self-harm (Ding et al., 2021). The research includes several glaring flaws that must be addressed. When evaluating the results, it’s important to keep in mind the sample size as well as self-report data from an Learning Analytics(LA)-aware demographic. Moreover, despite the study’s global scope, no replies from Asian, Middle Eastern, or African countries were recorded. As a result, future research should focus on providing more empirical information about educational institutes’ ability to use LA frameworks. National surveys may aid in the identification of shortcomings in higher education systems. More importantly, thorough empirical research is needed to assess the usefulness of LA frameworks for improving learning and teaching. Big data and predictive analytics have a lot of promise for supporting better, more efficient treatment, and there have been some notable recent advancements, especially in image analytics. However, because prediction has the power to affect decision-making, it also has the potential to do harm by disseminating false information at the point of care (Raghupathi and Raghupathi, 2014). In a profit-driven market, the risk of harm from inadequately verified models necessitates foresight. New medical technology, such as drugs and devices, are being scrutinized by the public. Physicians must also have a licence and be board certified. An independent agency that authenticates prediction models and approaches before allowing them to be used in clinical practice could help address some of these issues and ensure that predictive analytics deliver on their path to better value and outcomes for specific patients and the healthcare system as a whole. Clinicians play a critical part in the patient care process, diagnosing, establishing treatment protocols, and giving follow-up care. When statisticians and experienced physicians work to guarantee that the models can appropriately serve the doctors for whom they are designed, especially in complex areas like health, high prediction model designs arise (Giga, 2017). Health Canada, the MOHLTC, and LHINs are the decision-makers in terms of what is funded, and they must agree on the overall
How predictive analytics can empower your decision making 123 benefit of employing technology. According to research, countries are at varying phases of implementing EHRs, and while obstacles exist in implementing health information exchange methods, overall healthcare is growing in its use of technology (2019, Payne et al.) There is the possibility of having a greater threshold for error in industries like sales and marketing. The higher the marketing results, the more data that is easily accessible and can be absorbed into the analytics process. This might include ‘as much as 85% accuracy in predicting who will buy’. Although there is a 15 per cent rate of inaccuracy, an average accuracy of 85 per cent might be a difference of thousands of dollars. Because customer behaviour is becoming increasingly important in the business world, telecom operators are focusing not only on customer profitability to expand market share, but also on extremely loyal consumers and churning customers. Big data principles ushered in a new generation of Customer Relationship Management (CRM) strategies (David Martens, Clark, & Fortuny, 2016). Big data analysis can be used to explain customer behaviour, analyse routines, create appropriate marketing programmes, uncover sales transactions, and build long-term loyalty relationships. We are dealing with people’s lives in healthcare; therefore, it is a little more challenging. The acceptable overall accuracy must be established, as well as the thresholds that surround it. Ethical obligations and moral issues must be addressed when calculating acceptable accuracy rates. To reject or approve performance, criteria must be defined. Finally, who is responsible for a computer program’s recommendations must be established. Should the system’s developer, clinical experts, or the physician who falls into this category be held accountable if a mistake occurs? Clinicians, hospitals, and policymakers would need to feel confident depending on this system as an additional tool to aid in the delivery of treatment, just as they did when ultrasound and magnetic resonance imaging were initially introduced as technological improvements. Based on one research carried out by Johns Hopkins University researchers, diagnostic errors cause around 40 500 deaths in intensive care units in the United States each year. In 65 per cent of cases, system-related problems, poor processes, cooperation and communication were involved (Dilsizian & Siegel, 2014). Diagnostic errors like these add considerably to increasing healthcare costs, with each malpractice claim for a misdiagnosis caused by cognitive error costing an estimated $300 000. Aside from the expenditure of acquiring the data, properly using massive data sets requires a system with adequate storage capacity, the ability to connect across platforms, and the use of current technologies. Management must be willing to adopt new approaches, which may require a large commitment of time and money as well as a convergence of economic advantages. We have discussed earlier regarding the finding that one of the strengths of predictive analytics is that it can find relationships between data or predictors/indicators of possible losses. Nevertheless, one thing which is hard for the insurer is to be able to prove that they charge clients different rates of premium based on the results of a predictive model. Insurers’ use of an insurance score or credit score as a policy price determinant has been criticized by several consumer groups and authorities. It was difficult for insurers to justify using the relationship to create pricing policies because they couldn’t explain why there was a correlation between credit scores and loss ratios at first (Nyce, 2007). The cost factor associated with intervention research is very high. Furthermore, it includes very small subjects with a small period. Therefore, partnership at a public and private level to include many groups will be beneficial for a long period and very useful for potential researchers in a similar area (Hahn et al., 2017).
124 Handbook of big data research methods
CONCLUSION AND RECOMMENDATIONS Globally, corporations are working in a more highly competitive environment than before. The focus of this chapter was to identify the best practices of predictive analytics in different sectors. Those advantages will be of more use if current obstacles that businesses are facing are known so that predictive analytics can be used wisely to deal with these issues. Predictive analytics using big data is extremely valuable in different sectors and avenues. This has also revealed the use of predictive analytics across a range of industries: healthcare, retail, banking, government, financial services, insurance, government, utilities, logistics, and energy, to mention a few. Businesses should regard analytics as a strategic investment because adding new analytic capabilities has an influence on existing apps, devices, services, and websites. Important factors need to be considered, such as how to best deliver analytic information and reports to consumers and employees, as well as how to ensure that analytic content is interesting. Corporate leaders must take into consideration how a predictive analytics project will have an impact on users (in both ways externally and internally), current interfaces and applications, and executives, managers, marketers and salespeople. The two most difficult aspects of predictive analytics with big data are data collection and storage. Re-searching the data and disseminating it to the relevant individuals are among the challenges. It also has issues processing and integrating data after it has been extracted, as well as reducing risk by reducing clinical decision-making errors and providing additional medical help. With the announcement of the 2017 Canadian budget, which includes a $125 million commitment from the federal government to build an artificial intelligence plan to boost productivity and innovation, now is a great time for judgement to evaluate the many advantages of predictive analytics. Cost analysis studies must be explored as health practitioners and decision-makers seek better and more inventive approaches to manage cost and resourcing needs while offering high-value patient outcomes. Funds needed to purchase or build the IT platform, people resources (professional physicians, statisticians, developers), staff training, potential liabilities, and so on are examples of such costs.
FUTURE IMPLICATIONS In the future, it will be significant to perform a cost–benefit analysis to assess if the resources required to develop the model, including physician specialists to inform the models, are feasible. Early diagnosis and management, especially when the returns are based on assumptions, will lower the expenses. The topic of prevention’s return on investment has always been problematic because there is no guarantee of a bad outcome, and even if effective in preventing it, one cannot be confident that prevention efforts were the cause. Predictive analytics in the coming years will replace all traditional methods with automated forecasting models for treating different issues. This will further improve decision-making. Electronic records will swipe the old school methods of preserving records and will change the way things look in today’s world (Bonk, 2016). Nearly all decisions will be based on information technology. This will also be a potential for scholars as a gigantic amount of data will be available for further research.
How predictive analytics can empower your decision making 125
REFERENCES Ahuja, A.S. (2019). The impact of artificial intelligence in medicine on the future role of the physician. National Library of Medicine. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC6779111. Anderson, D., Bjarnadóttir, M.V. and Ross, D.G. (2022). Using people analytics to build an equitable workplace. Harvard Business Review, 27 January. Anonymous (2020). Predictive analytics in healthcare: Three real-world examples. (2020). Philips. Retrieved from https://www.philips.com/a-w/about/news/archive/features/20200604-predictive -analytics-in-healthcare-three-real-world-examples.html. Attaran, M. and Attaran, S. (2018). Opportunities and challenges of implementing predictive analytics for competitive advantage. International Journal of Business Intelligence Research, 9(2), 1–26. Retrieved from https://www.researchgate.net/publication/326332872_Opportunities_and_Challenges _of_Implementing_Predictive_Analytics_for_Competitive_Advantage. Attaran, M., & Attaran, S. (2019). Opportunities and challenges of implementing predictive analytics for competitive advantage. Applying Business Intelligence Initiatives in Healthcare and Organizational Settings, 64-90. Attaran, M., Stark, J., & Stotler, D. (2018). Opportunities and challenges for big data analytics in US higher education: A conceptual model for implementation. Industry and Higher Education, 32(3), 169–182. Austin, R.D. and Pisano, G.P. (2017). Neurodiversity as a competitive advantage. Harvard Business Review, 1 May. Badole, M. (2021). Data science use cases in retail industry. Analytics Vidhya. Retrieved from https:// www.analyticsvidhya.com/blog/2021/05/data-science-use-cases-in-retail-industry. Bari, A., Chaouchi, M., & Jung, T. (2016). Predictive analytics for dummies. John Wiley & Sons. Bradlow, E.T., Gangwar, M., Kopalle, P. and Voleti, S. (2017). The role of big data and predictive analytics in retailing. Journal of Retailing, 93(1), 79–95. Bonk, C. (2016). Keynote: What is the state of e-learning? Reflections on 30 ways learning is changing. Journal of Open, Flexible, and Distance Learning, 20(2), 6-20. Braveman, P. and Gottlieb, L. (2014). The social determinants of health: It’s time to consider the causes of the causes. National Library of Medicine, 129. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/ articles/PMC3863696/. Burns, E. (2016). How PayPal fights fraud with predictive data analysis. Retrieved from https:// www.predictiveanalyticsworld.com/machinelearningtimes/paypal-fights-fraud-with-predictive-data -analysis/7315/. Davenport, T.H. (2006). Competing on analytics. Harvard Business Review. Retrieved from https://hbr .org/2006/01/competing-on-analytics. Davenport, T. (2022). A predictive analytics primer. Harvard Business Review. Accessed 26 February 2022 at: https://hbr.org/2014/09/a-predictive-analytics-primer. Davenport, T.H., DalleMule, L. and Lucker, J. (2011). Know what your customers want before they do. Harvard Business Review. Retrieved from https://hbr.org/2011/12/know-what-your-customers-want -before-they-do. DeAngelis, S.F. (2015). Predictive analytics becoming a mainstream business tool. Enterra Solutions, 30 April. Dilsizian, S. E., & Siegel, E. L. (2014). Artificial intelligence in medicine and cardiac imaging: harnessing big data and advanced computing to provide personalized medical diagnosis and treatment. Current cardiology reports, 16, 1-8. Ding, J., Minhas, U. F., Chandramouli, B., Wang, C., Li, Y., Li, Y., ... & Kraska, T. (2021, June). Instance-optimized data layouts for cloud analytics workloads. In Proceedings of the 2021 International Conference on Management of Data (pp. 418-431). Fominaya, C.E. (2016). PHP154 – Using predictive analytics to audit pharmacy benefit decisions within the VA system. Value in Health, 19(3), A284–A284. Freedman, M. (2021). How businesses are collecting data (and what they’re doing with it). Business News Daily. Retrieved from https://www.businessnewsdaily.com/10625-businesses-collecting-data .html.
126 Handbook of big data research methods Germann, F., Lilien, G.L., Fiedler, L. and Kraus, M. (2014). Do retailers benefit from deploying customer analytics? Journal of Retailing, 90(4), 587–93. Giga, A. (2017). How health leaders can benefit from predictive analytics. Healthcare Management Forum, 30(6), 274–7. Hagiu, A. and Wright, J. (2020). When data creates competitive advantage and when it doesn’t. Harvard Business Review. Retrieved from https://hbr.org/2020/01/when-data-creates-competitive-advantage. Hahn, T., Nierenberg, A.A. and Whitfield-Gabrieli, S. (2017). Predictive analytics in mental health: Applications, guidelines, challenges and perspectives. Molecular Psychiatry, 22(1), 37–43. HR Analytics: Here to Stay or Short-Lived Management Fashion? https://files.eric.ed.gov/fulltext/EJ1110545.pdf Ifenthaler, D. (2017). Are higher education institutions prepared for learning analytics? TechTrends, 61(4), 366–71. Kandhammal, K. and Duraisamy, S. (2018). Review on big data challenges for 4G revolutions. International Journal of Advanced Research in Computer Science, 9(5). Khoury, M.J., Engelgau, M., Chambers, D.A. and Mensah, G.A. (2018). Beyond public health genomics: Can big data and predictive analytics deliver precision public health? Public Health Genomics, 21(5/6), 244–9. Retrieved from https://doi.org/10.1159/000501465. Lader, E. W., Cannon, C. P., Ohman, E. M., Newby, L. K., Sulmasy, D. P., Barst, R. J., ... & Costa, F. (2004). The clinician as investigator: participating in clinical trials in the practice setting. Circulation, 109(21), 2672-2679. Leventhal, B. (2010). An introduction to data mining and other techniques for advanced analytics. Journal of Direct, Data and Digital Marketing Practice, 12(2), 137–53. Retrieved from https://link .springer.com/article/10.1057/dddmp.2010.35#citeas. Marler, J. H., & Boudreau, J. W. (2017). An evidence-based review of HR Analytics. The International Journal of Human Resource Management, 28(1), 3–26 Martens, D., Provost, F., Clark, J., & de Fortuny, E. J. (2016). Mining massive fine-grained behavior data to improve predictive analytics. MIS quarterly, 40(4), 869-888. Martin, F. and Ndoye, A. (2016). Using learning analytics to assess student learning in online courses. Journal of University Teaching & Learning Practice, 13(3), 7. Mohsen Attaran, S. A. (2018). Opportunities and Challenges of Implementing Predictive Analytics for Competitive Advantage. Research Gate. Retrieved from https:// www .researchgate .net/ publication/326332872_Opportunities_and_Challenges_of_Implementing_Predictive_Analytics_for _Competitive_Advantage Munkevik Kenzler, E. and Rask-Andersen, V. (2020). Exploring the impact of digitalization on strategy development: A study of the healthcare and financial sector in Sweden. Retrieved from https://hdl .handle.net/20.500.12380/300678. N/A (2020). Descriptive, predictive & prescriptive analytics: What are the differences? Retrieved from UNSW Online: https://studyonline.unsw.edu.au/blog/descriptive-predictive-prescriptive-analytics. Nicasio, F. (2021). Retail analytics: How to use data to win more sales and customers. Vend. Retrieved from https://www.vendhq.com/blog/how-retailers-can-use-data-to-boost-productivity-customer-service -sales/. Nyce, C. (2007). Predictive analytics white paper. The Digital Insurer. Retrieved from https://www.the -digital-insurer.com/wp-content/uploads/2013/12/78-Predictive-Modeling-White-Paper.pdf. OECD (2020). Personal data use in financial services and the role of financial education. A consumer-centric analysis. Retrieved from https://www.oecd.org/finance/Personal-Data-Use-in-Financial-Services-and -the-Role-of-Financial-Education.pdf. Parikh, R., Obermeyer, Z. and Navathe, A. (2019). Regulation of predictive analytics in medicine: Algorithms must meet regulatory standards of clinical benefit. Science (American Association for the Advancement of Science), 363(6429), 810–12. Payne, T. H., Lovis, C., Gutteridge, C., Pagliari, C., Natarajan, S., Yong, C., & Zhao, L. P. (2019). Status of health information exchange: a comparison of six countries. Journal of global health, 9(2). Pesapane, F., Codari, M., & Sardanelli, F. (2018). Artificial intelligence in medical imaging: threat or opportunity? Radiologists again at the forefront of innovation in medicine. European radiology experimental, 2, 1–10.
How predictive analytics can empower your decision making 127 Prokofiev, I., Bakhtiari, S. and Hull, A. (2016). Prospective techniques for monitoring degradation in passive systems, structures, and components for nuclear power plant long-term operation. Paper presented at CORROSION 2016, 6–10 March, Vancouver, Canada. PwC (n.d.). Shifting patterns: The future of the logistics industry. Retrieved from https://www.pwc.com/ gx/en/transportation-logistics/pdf/the-future-of-the-logistics-industry.pdf. Qi, F. (2019). The data science behind self-driving cars. Medium.com. Retrieved from https://medium .com/@feiqi9047/the-data-science-behind-self-driving-cars-eb7d0579c80b. Raghupathi, W. and Raghupathi, V. (2014). Big data analytics in healthcare: Promise and potential. National Library of Medicine. Retrieved from https://www.ncbi.nlm.nih.gov/pmc/articles/ PMC4341817. Rajpurohit, A. (2014). Has predictive analytics crossed the chasm? Ventana Research. Richter, A., Lowner, M.O., Ebendt, R. and Scholz, M. (2020). Towards an integrated urban development considering novel intelligent transportation systems: Urban development considering novel transport. Technological Forecasting and Social Change, 155. Retrieved from https://www.sciencedirect.com/ science/article/pii/S0040162518319498. Salleh, R.S. (2019). Retail analytics. In B. Pochiraju and S. Seshadri (eds), Essentials of Business Analytics (pp. 599–621). Cham: Springer. Salleh, S. (2013). Applying predictive analytics in enterprise decision making. Retrieved from https:// analytica.com/applying-predictive-analytics-in-enterprise-decision-making. Sarikaya, R. (2017). The Technology Behind Personal Digital Assistants: An overview of the system architecture and key components. IEEE Signal Processing Magazine, 34(1), 67–81. SAS (2020). Predictive analytics: What it is and why it matters. Retrieved from https://www.sas.com/ en_au/insights/analytics/predictive-analytics.html. Shah, N.D., Steyerberg, E.W. and Kent, D.M. (2018). Big data and predictive analytics: Recalibrating expectations. Journal of the American Medical Association, 320(1), 27–8. Siegel, E. (2013). Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie, or Die. John Wiley & Sons. Sinha, M., Quan, D., McDonald, F. W., & Valdez, A. (2016). Cost minimization analysis of different strategies of management of clinically significant scorpion envenomation among paediatric patients. Pediatric emergency care, 32(12), 856–862. Starns, V.A. (2020). Exploring futuring and predictive analytics for developing organizational strategy. International Journal of Business Strategy and Automation, 1(4), 1–9. Stedman, C. (2017). Eyeing the future with predictive analytics can pay dividends now. Retrieved from http://searchbusinessanalytics.techtarget.com/ehandbook/Predictive-data-analytics-advances -businesses-ahead-of-the-game. Tursunbayeva, A., Di Lauro, S. and Pagliaris, C. (2018). People analytics: A scoping review of conceptual boundaries and value propositions. International Journal of Information Management, 43, 224–47. Tocci, G., Nati, G., Cricelli, C., Parretti, D., Lapi, F., Ferrucci, A., ... & Volpe, M. (2017). Prevalence and control of hypertension in the general practice in Italy: updated analysis of a large database. Journal of human hypertension, 31(4), 258-262. Ventana Research (2015). Next-generation predictive analytics. White paper, June. Retrieved from http://www.iconresources.com/Icon/eMailer/IBM/Ventana-Research/Ventana-Researchon-Nex t-generation-Analytics-june-2015.pdf. Wassouf, W.N., Alkhatib, R., Salloum, K. and Balloul, S. (2020). Predictive analytics using big data for increased customer loyalty: Syriatel Telecom Company case study. Journal of Big Data, 7(1), 1–24. Wells, C. (2016). Maryland universities to use data to predict student success – or failure. The Baltimore Sun, 11.
9. Gaussian process classification for psychophysical detection tasks in multiple populations (wide big data) using transfer learning Hossana Twinomurinzi and Herman C. Myburgh
1. INTRODUCTION The only way to understand an unseen phenomenon is through the physical manifestation of the phenomenon and the corresponding subjective response of the senses; this is known as psychophysics [71]. Psychophysics is therefore central to psychology [18] and other physical sciences such as medical diagnostics, neuroscience, engineering, architecture, and almost everything that humans and animals do [71]. Psychophysics began with early attempts to measure the relationship between mind and matter [61, 71, 27], specifically the relationship between energy in the environment and the response of the senses to that energy [36], focusing on the response of the observer to carefully controlled and systematically varied physical stimuli [36]. The concept of threshold (described in subsection 1.1) is therefore central to psychophysics. The fundamental power law of psychophysics, also known as the Weber–Fechner law, identifies that the subjective sense of intensity is related to the physical intensity of a stimulus by a logarithmic function [36]. Psychometric functions (PFs) are the mathematical instruments used to relate this behaviour of psychophysical detection tasks (PDTs) [27]. The shape of PFs is similar across the different psychophysical tasks: sight, sound, touch, taste, smell and a sense of time. In this chapter, for the sake of brevity, we focus on the psychophysical task of sound, notably on the ability to measure hearing, and recent developments in the use of a specialised Machine Learning (ML) method to improve its assessment in what is now regarded as computational audiology (CA) [64]. Particularly, we consider the current gap where ML-driven CA is limited to single populations (with large or small datasets) yet the opportunity exists to extend the capabilities across multiple populations (wide Big Data) [12] to speed up and reduce the costly PDT diagnostics [64]. Specifically, we seek to answer the question: How can Gaussian Process Classification (GPC) be extended to multiple populations (wide big data) to speed up and reduce costly PDT diagnostics? The remainder of the chapter is structured as follows: the next section describes the psychometric function and audiometry. It is followed by a description of machine learning, emphasising Gaussian Process Classification (GPC). The final section presents the systematic review of GPC used for audiogram estimation and offers insights on further research.
128
Gaussian process classification for psychophysical detection tasks 129 1.1
The Psychometric Function and Audiometry
Audiometry is the study of the ability to hear, whether through air or bone conduction [17, 71], with the audiometer as the traditional device used to assess hearing. The output from an audiometer is the audiogram, the central measure being the threshold. The threshold is the lowest intensity level or the faintest sound level that elicits a response from a subject, assuming a pure tone. The audiogram, usually in the form of a graph, remains the traditional means for recording the hearing threshold at each sound frequency. The numerical audiogram is not visually represented and is preferred in computational methods to estimate the threshold (audiogram estimation) [17]. On the audiogram, the frequency scale is on the x-axis and is represented logarithmically usually from 500Hz to 8000Hz, since subjective intensity increases in proportion to the logarithm of physical stimulus energy [36]. The y-axis represents the intensity in decibels using either the Sound Pressure Level (dB SPL) or the Hearing level (dB HL). In audiometry, the interest is not in the measured performance of the sound intensity but in the sensitivity of hearing; that is, the interest is in detecting even slight changes in hearing [27]. This is why the interest is in the PF slope and the threshold value. The α parameter represents the threshold value and is the 50 percent percentile on the abscissa. The β parameter indicates the slope or gradient of the curve [27]. There are four other precision and deviance values that are important in the PF: the standard error (SE) of the threshold, which measures the distance of the threshold from the true value of α, and the SE of the slope which measures the distance from the true value of β. The other two values are the statistical goodness of fit, which determines the extent to which the fitted PF is a true representation of the data (assuming a maximum likelihood method); the fourth value is the deviance with its corresponding p-value [27]. Kingdom et al. [27] identify five steps in estimating the audiogram threshold through measuring and fitting the PF, discussed next. Step 1: Choosing the stimulus level The first step is choosing the stimulus level. There are three choices to make: 1. The number of trials to be performed; 2. The range of stimulus levels; 3. The choice between linear or logarithmic spacing of the stimulus levels. The more trials there are, the more accurate the estimates on the fitted parameters. If the interest is to prove that the thresholds are different, fewer trials are sufficient compared to when the interest is to indicate slight differences. Curve fitting procedures might not converge on a fit if there is no sufficient data; 400 trials is a general rule of thumb if the desire is to estimate both the threshold and the slope. For a performance-based task, the choice of stimulus needs to be from the just noticeable difference (JND) to just under 100 percent correct. If more than one sound level produces an approximate chance performance, the lower end of the stimulus range needs to be shifted to a higher level – the same applies to the approximate 100 percent correct performance. There is no need to use many finely spaced stimuli. It is better to concentrate on responses to just a few appropriately distributed stimuli levels for reliable estimates.
130 Handbook of big data research methods Logarithmic spacing has some advantages over linear spacing as it allows for relatively small intervals at the low end of the range and relatively large intervals at the high end of the range. In the auditory domains, the transducer functions (the relationship between the physical and internal representation of a dimension) are generally decelerating; constant increases in stimulus intensity lead to smaller and smaller increases in internal intensity. Step 2: Types and choice of function Different theoretical approaches (Frequentist versus Bayesian) and the associated functions lead to different PFs [31]. For example, the a priori frequentist approaches work on the premise that the stimulus intensity must exceed a certain threshold for it to be noticed, while the a posteriori Bayesian approaches propose using PFs that most easily and accurately fit the data. Both are briefly discussed in Step 5 (see below). Step 3: Fitting the function Once the choice of theory and function have been made, the next step is to find (and optimise) the function that describes the data best. The most commonly used a priori function is the Maximum Likelihood Estimate (MLE) criterion and for a posteriori, Bayesian or Maximum A Posterior (MAP) estimations. The underlying principle of the MLE method is that statistical inference is best achieved by finding the most optimal way to fit a distribution to data. A good fitting procedure should accurately determine whether a unique maximum likelihood solution exists; if it does, it should be able to find it and avoid local minima/maxima. On the other hand, Bayesian estimation rests on the premise that the parameters that influence the model can be discovered incrementally according to the data based on known prior information and data [49]. The primary audiometric assessment function follows a 2-alternative forced choice (2AFC) task in which the participant responds to a sound stimulus. The psychometric function that defines the probability of correct detection given stimulus intensity x is given as the general form [68]:
x | , , , 1 F x; ,
(9.1)
The participant’s response is recorded as either yes = 1 (heard) or no = 0 (not heard). The two-parameter sigmoid function F (x; α, β) characterises the relationship between the response probability and the stimulus intensity following the logistic, cumulative normal or cumulative Gaussian form. For example, in using the logistic form, the PF is given by:
1
x | , , , 1 .
x ]
1 e
(9.2)
where the parameter vector θ = α, β, γ, δ of the psychometric function consists of α (threshold), β (slope), γ (guess rate – the tendency to incorrectly identify a stimulus below the threshold) and δ (lapse rate – the tendency not to notice the stimulus above the threshold). The design variable is stimulus intensity, d = x.
Gaussian process classification for psychophysical detection tasks 131 Step 4: Estimating the errors The estimates of the parameters are often based on a limited number of trials, and as such are only estimates of the “true” values of the parameters. The nature of achieving many clinical trials is naturally cumbersome, thereby lending itself to the efficiency that computerised (automated) audiometry methods bring [29]. Computerised audiometry is not, however, well suited for clinical use because patient variability and tester flexibility are the rule and not the exception [17]. Nonetheless, there have been important advances in computerised audiometry, which are the focus of this chapter. Step 5: Determining the goodness of fit (a priori) or certainty (a posteriori) In a priori methods the goodness of fit, which is determined by comparing two models statistically with a p-value less than 0.05 (or 0.01 sometimes), measures how well the fitted PF accounts for the data. In a posteriori methods, it is the notion of certainty that is incrementally achieved according to the data [5]. In recent years, computerised methods drawing mainly from machine learning, data science and statistics have received increasing attention in audiometry, particularly in the assessment of the psychometric function and audiogram estimation [40]. The next section discusses the computerised methods, particularly the specialised Gaussian Process Classification (GPC), and later reviews how GPC has been used with audiogram estimation. 1.2
Machine Learning
Machine learning (ML), in classical definition, is “the ability of a computer program to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E” [44]. The principal idea behind ML is the ability to predict outcomes (learn) from what is known (existing data) using algorithms [9, 43]. ML primarily draws from computer science, statistics, mathematics, engineering, physics and psychology [1, 4]. There are other names that have been used to represent ML, including artificial intelligence, pattern recognition, statistical modelling, data mining, knowledge discovery, predictive analytics, prescriptive analytics, data science, adaptive systems, self-organising systems, and many more [9]. The different names represent the ways in which the learning is organized: to improve the learning process (task oriented), to simulate the human learning process (cognitive simulation) or to find new learning methods and algorithms independent of domain [1, 43]. Psychophysics seeks to understand the learning process and is task oriented. ML provides the ability to improve the understanding of what is known at every succeeding attempt of the same task drawn from the same population [55]. Since ML programs are trained on data to identify outcomes of interest, the same programs can be transferred to other contexts with minimal changes to the code [9, 32]. This transferability and the increased availability of data make ML an appealing alternative to solve some of the most pressing challenges in the different disciplinary aspects of society [55]. There are three primary approaches in ML: supervised learning, unsupervised learning and reinforced learning. In supervised learning, the output is predefined; in unsupervised learning, the output is not defined upfront and patterns are identified from the data. Reinforced learning starts with exploration of certain optimal outputs, followed by exploitation to find the
132 Handbook of big data research methods best pattern to arrive at that optimal output [1]. Each of these ML approaches uses different combinations of statistical and/or computational methods, sometimes stacking the methods or running them in parallel [9]. Each of the different methods carries with it epistemological and ontological assumptions of what exists and how reality is modelled [9, 26, 56]. The choice of ML method therefore has an influence on what can be learned from data, and the extent to which generalisations can be made to new scenarios. ML therefore offers a great deal of promise in improving the sensitivity and specificity in audiogram estimation. We now turn attention to an increasingly popular Bayesian ML approach for audiogram estimation, Gaussian Processes (GPs), particularly GP classification (GPC) [30, 3, 58, 35]. 1.3
Gaussian Processes
A key advantage of Bayesian models for experimental designs such as audiogram estimation, compared with the alternative frequentist models, is in their treatment and interpretation of uncertainty as part of the learning process. For frequentist models, the probability of an event is related to the frequency of repeated events, that is, the model parameters are fixed regardless of varying data. Frequentist models do not deal with uncertainty; rather, their “rejection” of the null hypothesis offers a statement about the unusual nature of the observed data and not about the plausibility of alternative hypotheses. For Bayesian models, the probability of an event is related to the certainty (and uncertainty) of the event, that is, the model parameters vary while the data is fixed. To a Bayesianist, every problem is different and should be analysed contextually given the known information (prior). Gaussian Processes (GPs) are a Bayesian inference method, meaning that their approach to making predictions in machine learning is based on Bayes’ rule (theorem):
posterior=
likelihood×prior (9.3) marginal likelihood
where the posterior represents the updated belief (probability distribution) based on a prior belief (probability distribution) and a known likelihood with respect to data, the marginal likelihood. Bayesian models take the likelihood and covert it into a valid probability distribution, in this instance, a Gaussian distribution. To be Gaussian distributed means to belong to the Gaussian equation:
f x
1 e 2
x 2 2 2
(9.4)
where σ is the standard deviation, µ the expected value, and σ2 represents the variance. A Gaussian is a continuous type of probability distribution that describes random (stochastic) variables that are scalar or vector for multivariate distributions. A stochastic process generalises the probability distributions [47, 65] and therefore governs the properties of functions. A GP is, therefore, a probability distribution over functions. GPs are plotted against the index of the variables based on the confidence intervals and the expected value, µ.
Gaussian process classification for psychophysical detection tasks 133 GPs are therefore supervised learning problems. Supervised learning (see subsection 1.2) is inherently inductive as it begins with the data (a training set), and then seeks to identify the function that makes possible predictions for other new inputs (test set) [47]. ML inductive methods differ in the way they arrive at their generalisations of unobserved data; either by restricting the functions that map the inputs to the outputs (restriction bias) or by accepting all functions but assigning a higher probability to those functions that are more preferable (preference bias) [44]. The danger of restriction bias is missing the ideal function completely, while the challenge of the preference bias method is the intractable problem of how to choose from an infinite number of functions. GPs as stochastic processes resolve the challenge in preference bias by using a prior over suitable functions [7] and by using approximation (sampling or linear embedding) methods to assign a higher probability to functions that are more likely to suit the observed data [47]. Experimental designs, such as psychophysical audiogram assessments, are constrained by two competing goals: efficiency and precision. The amount of data (number of trials) affects precision – too few trials limits inferences from being made, while too many trials introduce more noise (especially from human factors), invariably reducing precision [68]. The solution is to balance precision and efficiency by optimising the design space (the number of trials) through efficient sampling and approximation methods. The two qualities of dealing with uncertainty (as can be expected from human participants) and the approximation efficiency make GPs an ideal ML candidate for audiogram estimation. 1.4
Gaussian Process Methods
Assume a supervised learning task with training data D =
x , y with input vector i
n i i 1
X x1 , , x1 associated with labels y yi , , yn where xi ∈ � R D and yi ∈ R . T
T
The purpose is to use the observed data in D to train a model and then using the trained model to make predictions on unobserved data X*. 1.5
Bayesian Inference
Supervised learning using GPs generalises Gaussian distributions over functions as latent random variables
f f1 , , f n
T
where any subset of f is Gaussian distributed [7]. The
result is a probability distribution p(f ). The GP then models the functions that are consistent with the observed data, and combined with the prior belief, gives the posterior distribution p f |D , thereby reducing uncertainty around the observed data. An illustration at one observed point is shown in Figure 9.1.
134 Handbook of big data research methods
Source: Author.
Figure 9.1
GP prior (left) GP posterior (right)
Since the Gaussian distribution is fully specified by the mean and covariance [39], the GP is similarly specified by its mean µ (popularly x 0x X ) and a positive semidefinite covariance matrix K, also known as the kernel function [47]. The kernel function defines the covariance structure of the functions as latent random variables:
cov( f xi , f x j K xi , x j | ) (9.5) where θ are the covariance parameters that define the kernel function and determine the shape and characteristics of the functions that can be drawn from the GP. The learning problem in GPs is therefore about selecting a most appropriate kernel function and then identifying the suitable properties of the chosen kernel function [47]. The most popular choice of kernel function is the homogeneous and isotropic radial basis function (RBF) [4, 38] given by:
K ( xi , x j | 2 2
x x exp( i
j
2 2
2
(9.6)
where σ is the scaling factor that determines the variation of function values from their mean [7, 13] and ℓ is the lengthscale associated with xi that describes the width of the kernel and therefore the smoothness of the functions of the model [11]. When the labels y of the supervised learning task are continuous, it is a GP regression learning task, a generalisation of linear regression (Gaussian). When the labels are discrete (as in the audiometric binary response set yes/no), it is a GP classification learning task, a generalisation of logistic regression (non-Gaussian). The generalisation methods in both GP learning tasks are different because of the conjugate /non-conjugate nature of the respective likelihoods with the priors. Inference is therefore approached differently in the two learning tasks, with the GP classification task being an adaptation of the GP regression task [4], discussed next.
Gaussian process classification for psychophysical detection tasks 135 1.6
Gaussian Process Regression
Linear regression models the relationship between a scalar output y with one or more noisy input observations x ∈ X through a latent function f:
f x xT w, y f x , ~ N 0, n2
(9.7)
where w is the vector of weights (parameters) of the linear model and � is Gaussian noise associated with the observed data in D. An assumption is often made that the noise is inde2 pendent and identically Gaussian distributed with mean 0 and variance σ [38]. The Bayesian approach to linear regression is to take f x � as the prior and update it as new data x ∈ X � is observed to create a posterior distribution using Bayes’ rule (equation 9.3). The GP approach is to transform the inputs x into a higher dimensional space using a kernel function (commonly the RBF in equation 9.6) in what is known as the kernel trick [11] to give:
p f x GP x , K X , X (9.8) Model selection To show how the latent function f relates to y, a likelihood function is introduced to give:
p y | f N f , 2 I
(9.9)
The Bayesian approach to model selection is through computation of the above model given the data, based on the marginal likelihood [47]:
p f | y, X
p y | f p f | X p y | X
(9.10)
where p ( f | X ) represents the prior beliefs on the characteristics of the functions that can be drawn from the GP, and p ( y | X ) represents the marginal likelihood, all possible values that f can take. The marginal likelihood is evaluated as:
p y|X � p y|f p f |X � df
(9.11)
136 Handbook of big data research methods Predictions The predictive distribution over unseen test points
x* � on noisy observations of f is then given
f x* which we shall denote by f* . The joint distribution between the training function values y and test function values f* is therefore:
by
x K X , X 2I y p N , x* K X * , X f*
K X , X* K X * , X *
(9.12)
Conditioning the multivariate Gaussian on the known training values f using Bayes’ rule yields:
p ( f* | x* , D) N ( f* ; f | D X * , K f |D X * , X * where
(9.13)
f |D X * is a valid mean:
f |D x* K X * , X [ K X , X 2 I ]1 y (9.14) and Kf |D(x*, x*) is also a valid kernel function:
K f |D x* , x* K X * , X * K X * , X [ K X , X 2 I ]1 K X , X *
(9.15)
For this study, the output in audiogram estimation is the discrete yes/no responses and therefore calls for GP classification. 1.7
Gaussian Process Classification
The task in GP classification is to predict the probability that the output falls into certain classes of predefined categories (labels) such that y = C1, C2, ..., Cc, which in the case of audiogram estimation is the binary yes/no response set, meaning that y 0,1. Model selection Unlike in GP regression, where the likelihood and prior are conjugate Gaussian, and the posterior is therefore also Gaussian, the prior in GP classification does not restrict the output to lie within the 0, 1 set and is therefore not conjugate with the likelihood. The strategy in discriminative predictive methods (different from generative methods) [25] is to use a sigmoid function Φ (usually the logistic function or cumulative Gaussian [47]) to monotonically transform the prior pointwise and restrict the output to lie in 0,1 :
p y 1| f f
1 1 e f
(9.16)
Gaussian process classification for psychophysical detection tasks 137
p ( y = 1 | f ) because
For the binary classification problem it is sufficient to determine p ( y = 0 | f ) can be obtained by 1 p ( y 1 | f ).
Predictions There are two steps involved in achieving the predictive posterior distribution: first to compute the distribution of the latent variable f* for the test case:
p (f* |X , y, x* ) p f* |X , x* , f p f |X , y df where gives:
(9.17)
p ( f | X , y ) is the posterior over the latent variables in equation (9.10). This therefore
p f* |X , y, x*
1 p y|f p f , f* |X , x * df p( y | X )
(9.18)
Including the sigmoid function gives:
p f* | X , y, x*
n 1 f p f , f* | X , x * df p ( y | X ) i 1
(9.19)
The second step is to use the distribution over the latest f* � with the sigmoid likelihood to produce a probabilistic predictive distribution, which in the binary instance for test point y* = 1 :
p y* 1| X , y, x* p y* 1| f* p ( f* | X , y, x* )df* f* p ( f* | X , y, x* )df*
(9.20)
The use of non-linear response functions means that the exact computation of the normalisation constant, p y 1|f p f df is no longer analytically possible. Therefore, approximation schemes, drawing from numeral analysis, are used [4]. The most common approximation techniques used with GPs are Markov Chain Monte Carlo (MCMC), Expectation Propagation (EP) and Laplace methods. Approximation schemes MCMC methods are a class of computational stochastic methods [24] that draw samples from the joint posterior distribution [4]. MCMC methods allow a distribution to be characterised without knowing all of the distribution’s mathematical properties by randomly sampling values out of the distribution [48]. Expectation propagation as a deterministic variational inference method [24] sequentially approximates the posterior distribution by optimising (through derivation) the first and second functions of the posterior distribution [33, 4]. Laplace approximation methods for binary outputs probabilistically approximate the posterior distribu-
138 Handbook of big data research methods tion by fitting a Gaussian distribution to a second order Taylor expansion of the logarithm of p ( f | X , y ) around the maximum of the posterior [4, 47]. Even though Laplace methods have been rarely used in GP classification learning tasks because of their poor approximation of the posterior [47], they are useful when dealing with large datasets [7]. For example, the National Institute for Occupational Safety and Health (NIOSH) noise database with audiograms from 1981 to 2010 has 1.8+ million audiograms [42]. Hyperparameters The selection of the parameters of the kernel function, for example in equation (9.6), the scaling factor σ2 and lengthscale ℓ, have quite an influence on the behaviour of the latent function. Shorter lengthscale values tend to produce overfitted models, compared to longer ones which produce underfitted models (though smoother) [13, 41]. It is therefore necessary to learn the most optimal values of these parameters, which are therefore called hyperparameters, because the kernel function is non-parametric [47]. This is usually done by maximum likelihood (a priori) using gradient descent. The a posteriori method is by integrating over all possible hyperparameters of the RBF:
p y | X , p ( y | f , X , ) p f | X , df
(9.21)
where θ is the hyperparameter that contains all the parameters concatenated into one vector [41]. However, equation (9.21) is analytically intractable because p ( y | f , X , θ ) is non-linear sigmoidal (equation 9.16) [66, 45]. The common approach is to maximise the log marginal likelihood, log p y | f , X , using the first partial derivative with respect to the hyperparameter [66, 45]
log p y|f , X , ¸ . ¸ i
We next present a systematic review of how GPC has been applied in audiogram estimation.
2. METHOD A systematic review was used to locate how GP classification (GPC) studies have been undertaken for audiogram estimation. 2.1
Protocol, Eligibility Criteria and Search
The review followed the PRISMA framework (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) as it provides measures for reporting quality, maintaining transparency and reducing bias [28] and improves the documentation of the review protocol [53]. The eligibility criteria were set as published studies and reports without any restrictions on the publication time horizon. There were no restrictions on the type of setting (journal / conference / book chapter / thesis). Only peer-reviewed publications reported in the English language were eligible for inclusion. GPC is primarily from the computer science, mathematics and statistical fields and is applied in different disciplinary contexts. The systematic review therefore represents a multi-
Gaussian process classification for psychophysical detection tasks 139 disciplinary search. The following five multidisciplinary electronic academic databases were accessed to retrieve peer-reviewed publications: IEEE Xplore digital library, Scopus, Science Direct, Google Scholar and the ACM Digital Library. Other terms related to audiogram estimation are psychometric function, audiometry and audiometric assessment. GPC does not have a synonymous name and was searched as such. The review used the Boolean operators “OR” and “AND” to construct the following search string; “Gaussian Process” AND classif* AND (audiogram OR audiometr* OR psychometric). The search was carried out in February 2021. 2.2
Selection and Extraction
Table 9.1 describes the inclusion, exclusion and quality criteria and Table 9.2 describes the process used to select the final set of papers. The initial search represents the number of articles returned when the search was performed on the metadata of the publication outlets (title, abstract and keywords). The first order represents the first screening based on the keywords, the title and the abstract. In the second order search, duplicates were removed. The third order search eliminated papers using the eligibility criteria. Cherry-picked items are those recommended by experts in the field or publications identified from references of other identified papers. Sixteen papers were finally selected for review and analysis. The university library portal was used to access the 16 papers where necessary. It should be noted that some of the technical details, especially the computational and mathematical methods, are sometimes published independently as full papers in either mathematics or computer science (machine learning) outlets without the audiology context. Such papers were excluded from the review. Table 9.1
Inclusion, exclusion and quality criteria of the systematic review
No.
Inclusion criteria
Exclusion criteria
Quality criteria
1
Academic publications
Predatory outlets
Reputable sources such as accredited publications and/or
2
Double-blind peer-reviewed
Non-peer-reviewed
For theses which have been published in conference
publications
publications
proceedings or as journal articles, the publication was
3
Human hearing
Animal hearing
4
Gaussian Process
Does not focus on GPC or
Classification (GPC)
audiogram estimation
international organisations (to avoid predatory outlets).
retained.
5
Audiogram estimation
Not written in English
6
Recommended grey area
literature (non-peer-reviewed)
Table 9.2
Paper and report selection process
Electronic database
Initial search
1st order selection
2nd order selection
3rd order selection
Scopus / Web of Science
3
3
14
12
IEEE Xplore
0
0
Science Direct
0
0
Google Scholar
35
21
ACM Digital Library
8
1
140 Handbook of big data research methods Electronic database
Initial search
1st order selection
2nd order selection
3rd order selection
Cherry-picked
0
0
0
4
Total
46
25
14
16
Table 9.3
Summary of analysed papers
#
By
Year
Country — Type — Summary
1
[20]
2008
Netherlands (Journal): A probabilistic kernel is described for pairwise learning based on GPCs for predicting speech quality degradation in hearing aids. The GP method is designed to distinguish between hearing impaired and normal hearing patients. Laplace approximation is used.
2
[21]
2011
Netherlands (Journal): The paper considers sound quality as perceived in users of hearing aids and develops a GP method to predict pairwise preference learning. The method individualises the learning process in hearing aids, taking into account individual hearing-impaired listener preferences. The probabilistic kernel approach improves the preference judgements. The psychophysical tests are framed as pairwise preference learning.
3
[46]
2015
Denmark (Journal): The paper develops a personalised 2D and 4D audiogram estimation method to tune hearing aids over a wider PF scale using active learning and GPs. The method using direct feedback from users was preferable. The GP model uses Laplace approximation.
4
[60]
2015
USA (Journal ): The paper develops the non-parametric approach to audiogram estimation using GP regression. Compared with the Hughson–Westlake (HW) traditional method, the new GP approach is faster, uses significantly fewer samples and is methodologically accurate and reliable. The GP method also does not need human supervision but can be carried out by the individual, and is much more difficult for the individual to manipulate.
5
[14]
2015
USA (Conference): The study uses latent information from binary audiometric tests. The psychophysical tests are framed as an active learning problem. The next stimulus to be presented is conditioned on the previous one. The paper reduced iterations by 85 percent. Prior knowledge about the class of the stimuli is introduced which assists with selecting sample points – multiple points can also be tested simultaneously.
6
[15]
2015
USA (Conference): The paper applied the active learning approach on noise-induced hearing-impaired listeners (NIHL) and called it Bayesian Active Model Selection (BAMS). Far fewer responses are seen for NIHL patients, much faster than traditional methods and can be done in real-time. NIHL was framed as an active learning problem.
7
[6]
2017
Netherlands (Conference): A database of audiograms from the Nordic countries is used as a prior in the GP, and compared with previous work by Gardner, 2015 [15,16] that used uninformative priors. A Gaussian Mixture Model is adopted with the RBF as the kernel. The mixture function is not fixed but rather uses a nearest neighbour regression model. The choice of a mixture model is because all audiograms fit into certain sub-groups. The process improves the predictive accuracy of the threshold in audiogram estimation. Uses GP, Gaussian mixture model and gradient descent. The GP prior using the mixture model outperforms those empirically optimised, and even more when conditioned with age, and furthermore with age and gender.
8
[8]
2017
USA (Thesis - PhD): The thesis develops the simultaneous estimation of audiograms from both ears. Speeds up testing of both ears since information is shared between both ears. The thesis uses Gardner et al. [14, 15] as a baseline.
9
[58]
2017
USA (Journal): A non-parametric Bayesian multidimensional psychometric function estimator is developed for the binary responses used in audiogram estimation. The simulated tests on 1-dimensional and 2-dimensional simulations both offer comparable accuracy as the parametric methods. The GPC offers similar accuracy with the 1D measures.
10 [51]
2018
USA (Journal): Two Bayesian active learning methods are extended by incorporating a lapse rate into the likelihood function. Both are faster compared with traditional audiogram estimation. The “counting ML” was more efficient than the Yes/No ML used in Gardner et al. [15].
Gaussian process classification for psychophysical detection tasks 141 #
By
11 [59]
Year
Country — Type — Summary
2018
USA (Journal): The paper introduces a non-parametric approach using active learning to psychometric function estimation; this approach achieves the same accuracy but with fewer trials. Extends Song et al.’s [60] work. Active sampling methods particularly from Song et al. [60] were found to be superior.
12 [2]
2019
USA (Journal): The paper takes from Song et al. [60] using GPs for pure-tone audiometry for unilateral ears and creates a mobile app that is used for audiometric assessment. The GPs are treated as a black-box and not explained but assumed. The results yield faster and fewer trials.
13 [3]
2019
USA (Journal): The paper proposes conjoining audiogram estimation into a single search space. Implemented the conjoint audiogram estimation to have the bilateral audiogram estimation. The results show an improvement compared with sequential testing in both ears.
14 [41]
2019
USA (Thesis - PhD): The multi-pronged thesis focuses on making more efficient active learning as a method for selecting the next set of data for the next label. The method balances between exploration and exploitation in the search process. Ultimately, the thesis develops a Bayesian optimisation model selection (BOMS) approach which improves active learning compared to previous models. The thesis makes a number of mathematical proofs and efficiency improvements concerning the challenge of the active search mechanism.
15 [34]
2019
USA (Thesis - Masters): The thesis uses the NIOSH database as a prior to speed up learning and extends Barbour et al.’s [3] conjoint method to utilise the covariance across the conjoint audiometric tests to detect changes in an individual’s audiograms between tests.
16 [22]
2020
USA (Thesis - PhD): The thesis extends the active machine learning audiogram (AMLAG) method for audiogram estimation in one ear to both ears bilaterally. The thesis also creates a dynamic masking protocol that is more efficient for time and threshold estimation among hearing-impaired listeners but is also effective for normal hearing listeners. The results are faster compared to sequential estimation from unilateral estimations. The bilateral framework is extended to cognitive and perceptual joint assessments, except that the results in that area did not suggest any significant improvements. AMLAG was generalised from two unilateral tests to a single bilateral test, allowing observations in one ear to update the model fit in the other ear. The approach speeds up final estimates compared to sequential estimations.
2.3
Discussion of GPC in Audiogram Estimation
Transfer learning to wide Big Data missing GPC has been restricted to single population groups and not tested for efficiency across multiple population groups in what is known as transfer learning [57]. Transfer learning involves the ability of ML models to adapt themselves to new situations, tasks and environments [67]. Specifically, Yang et al. [67] define transfer learning as “given source domain Ds and learning task Ts, a target domain Dt and corresponding learning task Tt, transfer learning aims to improve the learning of target predictive function f t for the target domain using the knowledge in Ds and Ts where Ds = Dt or Ts = Tt”. One of the key benefits of transfer learning is transferring experience gained from one context to another. This additionally means that PDTs can be performed for significantly less cost in regions without equipment. Only air-conduction All the reviewed papers assumed air conduction or run their simulations based on air conduction results; none considered bone-conduction. There is therefore an opportunity to extend GPC to bone-conduction audiometry, even though the PDT task remains quite the same. In bone conduction, sound bypasses the air conduction mechanism of the outer and middle air and goes straight to the inner ear.
142 Handbook of big data research methods USA most active All the studies came from three specific countries: 12 are from the USA, 3 from the Netherlands, and 1 from Denmark. The Danish paper by Nielsen et al. [46] and 2 of the Dutch papers, one by Schlittenlacher et al. [51] and the other by Cox and De Vries [6], are all extensions of Gardner et al.’s [15], though implemented differently. The greatest influence on GPC for audiogram estimation has therefore been from Denis L Barbour’s lab at the University of Washington in St. Louis; he is either a co-author or supervisor on each of the 12 papers and theses respectively. Early beginnings and the entry of active learning The earliest papers that used GPC were both by Groot et al. [20, 21], which particularly focused on sound quality. In 2015, Nielsen et al. [46] and Song et al. [60] both used GP regression, except that Nielsen et al. [46] introduced active learning as the approximation method to overcome the intractability challenge of Bayesian models. The conference papers by Gardner et al. [14, 15] improved the paper of Song et al. [60], which had used GP regression, and introduced GPC using active learning as well. Gardner et al. [14, 15] thereafter become the base upon which all further audiogram estimation using GPC is done, including the paper from the Netherlands by Cox and Vries [6] who introduced an informative prior using a Gaussian Mixture Model. The proceeding papers wrap the Bayesian active learning model of Gardner et al. [14, 15] in a mobile app [3] or extend the base model to assess both ears simultaneously [22, 41, 2, 34]. There were no similar publications in 2009–2010, 2012–2014 or 2016. Bayesian active learning Active learning, also known as active sampling, is an emergent sampling scheme that continually seeks to use the most informative data to train a model rather than having the model trained on a random sample from the entire dataset, as is done in passive sampling [62, 52]. Active learning draws from the principle of sampling in statistics which is also closely related to approximation in mathematics (numerical analysis) and interpolation, and similarly to compression from signal processing in engineering [24]. The principle is to use as few samples as possible drawn from an independent distribution to avoid having to deal with the intractable problem of working with an entire dataset, especially high dimensional datasets [19] or in instances when obtaining data is expensive or is scarce [23, 62]. Settles [52] identified three active learning scenarios: pool-based active learning, membership query synthesis and stream-based selective sampling. Houlsby et al. [23] adapted the membership query method of active learning to GPC using the information theoretic approach, particularly using Shannon’s theory of communication [54], and called it Bayesian Active Learning by Disagreement (BALD):
max H [θ | D] - E y ~ p ( y| x , D ) [ H [θ | y, x, D]] (9.22) x
Gardner et al. [14] adapted Houlsby et al.’s [23] Bayesian active learning by disagreement (BALD) and Garnett et al.’s [16] active learning method for GPs and applied them to audiogram estimation.
Gaussian process classification for psychophysical detection tasks 143
3. CONCLUSION We presented a brief tutorial on the emergent Gaussian Process Classification (GPC) and its application in psychophysical detection tasks (PDTs), with a particular focus on audiogram estimation. Other psychophysical tasks include touch, taste, smell and vision. PDTs are increasingly enjoying attention especially in a world where artificial intelligence (AI) agents are used to engineer human cognition and behaviour. Psychophysics offers an opportunity to understand and deal with the impact of such unseen AI agents. GPC has more recently been significantly influenced by active learning as a means to further reduce the number of PDT task trials required for the GPC to learn from. GPC methods combined with active learning reduced the PDTs in audiometry from more than 400 trials to less than 16 tasks, and in some instances extended assessment to both ears at the same time. This is a significant reduction in both cost and time for clinical assessments, which are both time-consuming and expensive. The computational assessments can further also be done using easily available digital devices such as a mobile or web-based app. The chapter makes a contribution by extending the use of the powerful Bayesian GPC method beyond active learning to include transfer learning, and thereby generalising beyond the method to the task, in this instance, PDTs. This elevates the role of ML to beyond data to meaningful diagnostic contributions in resource constrained regions. Further, none of the studies that used GPC extended beyond sensory PDTs to non-sensory functions that might also affect the sensory stimuli, for example environmental factors. For example, in audiometry, there are many pathophysiologic factors that influence the hearing ability beyond age, gender and the noise history that make classification of audiograms based on threshold data complex. These include injury, disease, medication, diet, altitude, smoking history and many others [10, 63, 50, 37]. The nature of PDTs is therefore more than the simple sensory-stimulus response. There is consequently an opportunity to extend the powerful GPC to include non-sensory factors that influence PDTs.
REFERENCES 1. Alpaydin, E. (2014). Introduction to Machine Learning. London: MIT Press. 2. Barbour, D.L., DiLorenzo, J.C., Sukesan, K.A., Song, X.D., Chen, J.Y., Degen, E.A., Heisey, K.L. et al. (2019). Conjoint psychometric field estimation for bilateral audiometry. Behavior Research Methods 51(3), 1271–85. Accessed at https://doi.org/10.3758/s13428 -018-1062-3. 3. Barbour, D.L., Howard, R.T., Song, X.D., Metzger, N., Sukesan, K.A., DiLorenzo, J.C., Snyder, B.R. et al. (2019). Online machine learning audiometry. Ear and Hearing 40, 918–26. Accessed at https://doi.org/10.1097/AUD.0000000000000669. 4. Bishop, C.M. (2006). Pattern Recognition and Machine Learning. New York: Springer Science Business Media BV, 1st edn. 5. Carlin, B.P. and Louis, T.A. (2000). Bayes and Empirical Bayes Methods for Data Analysis, 2nd edn. Accessed at www.crcpress.co. 6. Cox, M. and de Vries, B. (2017). A Gaussian process mixture prior for hearing loss modeling. In: Duivesteijn, W., Pechenizkiy, M. and Fletcher, G.H.L. (eds), Benelearn 2017: Proceedings of the Twenty-Sixth Benelux Conference on Machine Learning. pp. 74–6.
144 Handbook of big data research methods openscholarship .wustl .edu/ cgi/ Technische Universiteit Eindhoven. Accessed at http:// viewcontent.cgi?article=1231context=eng. 7. Cutajar, K. (2019). Broadening the scope of Gaussian processes for large-scale learning. Accessed at https://www.eurecom.fr/publication/5852. 8. Dilorenzo, J. (2017). Conjoint audiogram estimation via Gaussian process classification. MSc, Washington University. Accessed at https://openscholarship.wustl.edu/eng. 9. Domingos, P. (2015). The Master Algorithm. New York: Basic Books. 10. Dubno, J.R., Eckert, M.A., Lee, F.S., Matthews, L.J. and Schmiedt, R.A. (2013). Classifying human audiometric phenotypes of age-related hearing loss from animal models. Journal of the Association for Research in Otolaryngology 14, 687–701. Accessed at https://doi.org/10.1007/s10162-013-0396-x. 11. Duvenaud, D.K. (2014). Automatic model construction with Gaussian Processes. PhD thesis. Accessed at https://www.repository.cam.ac.uk/handle/1810/247281. 12. Fosso Wamba, S., Akter, S., Edwards, A., Chopin, G. and Gnanzou, D. (2015). How “big data” can make big impact: Findings from a systematic review and a longitudinal case study. International Journal of Production Economics 165, 234–46. Accessed at https:// doi.org/10.1016/j.ijpe.2014.12.031. 13. Gabasova, E. (2014). Covariance functions. Accessed at http://evelinag.com/Ariadne/ covarianceFunctions.html. 14. Gardner, J.R., Song, X., Weinberger, K.Q., Barbour, D. and Cunningham, J.P. (2015). Psychophysical detection testing with Bayesian active learning. In: Uncertainty in Artificial Intelligence - Proceedings of the 31st Conference, UAI. pp. 286–95. Accessed at http://www.stat.columbia.edu/~cunningham/pdf/GardnerUAI2015.pdf. 15. Gardner, J.R., Weinberger, K.Q., Malkomes, G., Barbour, D., Garnett, R. and Cunningham, J.P. (2015). Bayesian active model selection with an application to automated audiometry. In: Advances in Neural Information Processing Systems. pp. 2386–94. Accessed at https:// dl.acm.org/doi/abs/10.5555/2969442.2969506. 16. Garnett, R., Osborne, M.A. and Hennig, P. (2015). Active learning of linear embeddings for Gaussian processes. In: Proceedings of the 30th Conference on Uncertainty in Artificial Intelligence. pp. 230–39. AUAI Press. Accessed at http://www.gaussianprocess.org/gpml/ code. 17. Gelfand, S. (2016). Essentials of Audiology. Thieme Medical Publishers, 4th edn. www 18. Gescheider, G. (2013). Psychophysics: The Fundamentals. Accessed at https:// .taylorfrancis.com/books/9781134801220. 19. Gomez-Rubio, V. (2020). Bayesian Inference with INLA. Boca Raton, FL: CRC Press. Accessed at https://doi.org/10.1201/9781315175584. 20. Groot, P., Heskes, T. and Dijkstra, T. (2008). Nonlinear perception of hearing-impaired people using preference learning with Gaussian Processes. Radboud University Nijmegen, pp. 1–19. 21. Groot, P.C., Heskes, T., Dijkstra, T.M. and Kates, J.M. (2011). Predicting preference judgments of individual normal and hearing-impaired listeners with Gaussian processes. IEEE Transactions on Audio, Speech and Language Processing 19(4), 811–21. Accessed at https://doi.org/10.1109/TASL.2010.2064311. 22. Heisey, K. (2020). Joint estimation of perceptual, cognitive, and neural processes. PhD thesis, May. Accessed at https://doi.org/https://doi.org/10.7936/zssf-hg81, https:// openscholarship.wustl.edu/artscietds.
Gaussian process classification for psychophysical detection tasks 145 23. Houlsby, N., Huszar, F., Ghahramani, Z. and Lengyel, M. (2011). Bayesian active learning for classification and preference learning, December. Accessed at https://arxiv.org/abs/ 1112.5745. 24. Isaac Newton Institute (INI) (2019). Approximation, sampling and compression in data science. Accessed at https://www.newton.ac.uk/event/asc. 25 Jebara, T. (2012). Machine Learning: Discriminative and Generative. New York: Springer Science and Business Media. 26. Kuhn, T.S. (1962). The Structure of Scientific Revolutions. University of Chicago Press, 3rd edn. 27. Kingdom, F.A.A. and Prins, N. (2016). Psychophysics: A Practical Introduction. Elsevier Academic Press. 28. Knobloch, K., Yoon, U. and Vogt, P.M. (2011). Preferred reporting items for systematic reviews and meta-analyses (PRISMA) statement and publication bias. Journal of Cranio-Maxillofacial Surgery 39(2), 91–2. Accessed at https://doi.org/10.1016/j.jcms .2010.11.001. 29. Kramer, S. and Brown, D. (2018). Audiology: Science to Practice. San Diego, CA: Plural Publishing. 30. Kujala, J.V. and Lukka, T.J. (2006). Bayesian adaptive estimation: The next dimension. Journal of Mathematical Psychology 50, 369–89. Accessed at https://doi.org/10.1016/j .jmp.2005.12.005. 31. Lambert, B. (2018). A Student’s Guide to Bayesian Statistics. London: Sage. 32. Langley, P. and Simon, H.A. (1995). Applications of machine learning and rule induction. Communications of the ACM 38. 33. Larsen, P.G., Plat, N. and Toetenel, H. (1994). A formal semantics of data flow diagrams. Formal Aspects of Computing 6(6), 586–606. Accessed at https://doi.org/10.1007/ BF03259387. 34. Larsen, T. (2019). Differential estimation of audiograms using Gaussian process active model selection. Masters, Washington University in St. Louis. Accessed at https:// openscholarship.wustl.edu/engetds/4. 35. Larsen, T.J., Malkomes, G. and Barbour, D.L. (2020). Accelerating psychometric screening tests with Bayesian active differential selection. ArXiv. Accessed at http://arxiv.org/ abs/2002.01547 https://arxiv.org/abs/2002.01547. 36. Lawless, H.T. (2013). Quantitative Sensory Analysis: Psychophysics, Models and Intelligent Design. Wiley. Accessed at https://doi.org/10.1002/9781118684818. 37. Li, W., Zhao, Z., Chen, Z., Yi, G., Lu, Z. and Wang, D. (2021). Prevalence of hearing loss and influencing factors among workers in Wuhan, China. Environmental Science and Pollution Research 28(24), 31511–19. Accessed at https://doi.org/10.1007/s11356-021 -13053-y, https://link.springer.com/article/10.1007/s11356-021- 13053-y. 38. MacKay, D. (1998). Introduction to Gaussian processes. In Barber and Williams (eds), Neural Networks and Machine Learning, Springer-Verlag. pp. 84–92. (1998), Accessed at http://www.cs.toronto.edu/~radford/. 39. MacKay, D.J.C. (1992). Bayesian interpolation. Neural Computation 4, 415–47. Accessed at https://doi.org/10.1162/neco.1992.4.3.415. 40. Mahomed, F., Swanepoel, D.W., Eikelboom, R.H. and Soer, M. (2013). Validity of automated threshold audiometry: A systematic review and meta-analysis. Ear and Hearing 34, 745–52. Accessed at https://doi.org/10.1097/01.aud.0000436255.53747.a4.
146 Handbook of big data research methods 41. Malkomes, G. (2019). Automating Active Learning for Gaussian Processes. PhD thesis. 42. Masterson, E.A., Deddens, J.A., Themann, C.L., Bertke, S. and Calvert, G.M. (2015). Trends in worker hearing loss by industry sector, 1981–2010. American Journal of Industrial Medicine 58(4), 392–401. Accessed at https://doi.org/10.1002/ajim.22429. 43. Michalski, R.S., Carbonell, J. and Mitchell, T. (1983). An overview of machine learning. Accessed at https://www.sciencedirect.com/science/article/pii/B9780080510545500054. 44. Mitchell, T.M. (1997). Machine Learning. Burr Ridge, IL: McGraw Hill. 45. Nickisch, H. and Rasmussen, C.E. (2008). Approximations for binary Gaussian process classification. Journal of Machine Learning Research 9, 2035–78. Accessed at https:// www.jmlr.org/papers/volume9/nickisch08a/nickisch08a.pdf. 46. Nielsen, J.B.B., Nielsen, J. and Larsen, J. (2015). Perception-based personalisation of hearing aids using Gaussian processes and active learning. IEEE/ACM Transactions on Audio Speech and Language Processing 23(1), 162–73. Accessed at https://doi.org/10 .1109/TASLP.2014.2377581. 47. Rasmussen, C.E. and Williams, C.K.I. (2006). Gaussian Processes for Machine Learning. MIT Press. 48. Van Ravenzwaaij, D., Cassey, P. and Brown, S.D. (2018). A simple introduction to Markov Chain Monte–Carlo sampling. Psychonomic Bulletin and Review 25(1), 143–54. Accessed at https://doi.org/10.3758/s13423-016- 1015-8. 49. Robert, C.P. (2007). The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. Springer. Accessed at https://doi.org/10.1007/0-387 -71599-1. 50. Saini, S., Sood, A., Kotwal, N., Kotwal, A. and Gupta, T. (2021). A pilot study comparing hearing thresholds of soldiers at induction and after completion of one year in high altitude area. Medical Journal Armed Forces India. Accessed at https://doi.org/10.1016/j.mjafi .2021.04.010. 51. Schlittenlacher, J., Turner, R.E. and Moore, B.C.J. (2018). Audiogram estimation using Bayesian active learning. The Journal of the Acoustical Society of America 144(1), 421–30. Accessed at https://doi.org/10.1121/1.5047436. 52. Settles, B. (2012). Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning 18, 1–111. Accessed at https://doi.org/10.2200/S00429ED1V01Y20 1207AIM018. 53. Shamseer, L., Moher, D., Clarke, M., Ghersi, D., Liberati, A., Petticrew, M., Shekelle, P. et al. (2015). Preferred reporting items for systematic review and meta-analysis protocols (PRISMA-P) 2015: elaboration and explanation. BMJ 7647(January), 1–25. Accessed at https://doi.org/10.1136/bmj.g7647. 54. Shannon, C.E. (1948). A mathematical theory of communication. Bell System Technical Journal 27(3), 379–423. Accessed at https://doi.org/10.1002/j.1538-7305.1948.tb01338.x. 55. Simon, H.A. (2014). Why should machines learn? Machine Learning. Accessed at https:// doi.org/10.1016/b978-0-08-051054-5.50006-6. 56. Sipser, M. (2013). Introduction to the Theory of Computation, vol. 27. Accessed at https:// doi.org/10.1145/230514.571645. 57. Skolidis, G. (2012). Transfer learning with Gaussian Processes. Thesis, University of Edinburgh, pp. 153–60.Accessed at http://dx.doi.org/10.1016/j.knosys.2015.01.010.
Gaussian process classification for psychophysical detection tasks 147 58. Song, X.D., Garnett, R. and Barbour, D.L. (2017). Psychometric function estimation by probabilistic classification. The Journal of the Acoustical Society of America 141, 2513–25. Accessed at https://doi.org/10.1121/1.4979594. 59. Song, X.D., Sukesan, K.A. and Barbour, D.L. (2018). Bayesian active probabilistic classification for psychometric field estimation. Attention, Perception, and Psychophysics 80(3), 798–812. Accessed at https://doi.org/10.3758/s13414-017-1460-0. 60. Song, X.D., Wallace, B.M., Gardner, J.R., Ledbetter, N.M., Weinberger, K.Q. and Barbour, D.L. (2015). Fast, continuous audiogram estimation using machine learning. Ear and Hearing 36(6), e326–e335. Accessed at https://doi.org/10.1097/AUD.0000000000000186. 61. Stevens, S.S. (1957). On the psychophysical law. Psychological Review 64, 153–81. Accessed at https://doi.org/10.1037/h0046162, https://psycnet.apa.org/record/1958-04769 -001. 62. Tong, S. (2001). Active learning: Theory and applications. PhD thesis. Accessed at https:// ai.stanford.edu/~koller/Papers/Tong:2001.pdf.gz. 63. Wang, Q., Qian, M., Yang, L., Shi, J., Hong, Y., Han, K., Li, C. et al. (2021). Audio-metric phenotypes of noise-induced hearing loss by data-driven cluster analysis and their relevant characteristics. Frontiers in Medicine 8. Accessed at https://doi.org/10.3389/fmed.2021 .662045. 64. Wasmann, J.W.A., Lanting, C.P., Huinck, W.J., Mylanus, E.A., van der Laak, J.W., Govaerts, P.J., Swanepoel, D.W. et al. (2021). Computational audiology: New approaches to advance hearing health care in the digital age. Ear and Hearing 42(6), 1499–507. Accessed at https://doi.org/10.1097/AUD.0000000000001041. 65. Wilkinson, R. (2020). Introduction to Gaussian processes – Youtube. Accessed at http:// gpss.cc/gpss19/. 66. Williams, C.K. and Barber, D. (1998). Bayesian classification with Gaussian processes. IEEE Transactions on Pattern Analysis and Machine Intelligence 20(12), 1342–51. Accessed at https://doi.org/10.1109/34.735807. 67. Yang, Q., Zhang, Y., Dai, W. and Pan, S. (2020). Transfer Learning. Cambridge: Cambridge University Press. 68. Yang, Q. and Lee, Y. (2019). An investigation of enablers and inhibitors of crowdfunding adoption: Empirical evidence from startups in China. Human Factors and Ergonomics in Manufacturing & Service Industries 29(1). 69. Zwislocki, J.J. (2009). Sensory Neuroscience: Four Laws of Psychophysics. Springer.
10. Predictive analytics for machine learning and deep learning Tahajjat Begum
WHAT IS PREDICTIVE ANALYTICS? Today we have an influx of information surfacing around us in different forms and presentations. We can even term data as one of the most valuable assets for businesses in making strategic decisions. Companies need to conduct customer surveys or quarterly financial reporting or analyze customer buying behavior; the need for analytics for business is critical for future success. Every company tries to achieve a competitive advantage to acquire more market share. Therefore, the need to ascertain new analytic techniques is great. Companies now want to detect fraud, identify future business risks or find new customers, improve business operation or even predict customers’ buying behavior (McCarthy et al., 2022). We are now in the big-data era where charts and graphs are not the only means to provide a visual representation. We are in an era where foresight and analytics play a significant role in business. The need for predictive analytics commenced when forecasting the future outcomes using an advanced-level analytical tool emerged. For predictive analytics, we need to analyze data; therefore, techniques like data mining, statistics, modeling, machine learning, deep learning, or even artificial intelligence have become an integral part of predictive analytics. Therefore, it is imperative to understand data science first to understand predictive analytics. According to Waller and Fawcett (2013), data science solves problems using data by predicting outcomes. They also explained that data science requires both broad and narrow sets of analytical skills. Predictive analytics is a subset of data science, and without understanding data science and big data, it won’t be easy to understand the true nature of predictive analytics. Businesses are now using historical data to decipher patterns to identify risks and grasp opportunities using machine learning, deep learning, and statistical modeling techniques. Every industry, the entertainment sector or health care, retail or utilities, even the public sector are now turning to predictive analytics (Maciejewski et al., 2011), not only to identify business risks or optimize business operations but also to increase revenues and profits to achieve competitive differentiation from their competitors. In simple words, you can determine the future performance of your business to make strategic decision-making using predictive analytics. Predictive analytics is a forward-looking approach and tries to answer the question, “what is the future outcome?” (Barga et al., 2015). Predictive analytics uses historical data to predict trends and patterns in strategic decision-making to find the answer to the above question. Predictive analytics refers not only to finding patterns and information from data but also tries to leverage the findings to anticipate the future by making the prediction. It is also essential to understand the difference between machine learning, deep learning, and predictive analytics. You can create different predictive models using machine learning and deep learning techniques. Predictive analytics, data science, and big data are closely associated as we have a deluge of data around us. Companies can gather data from 148
Predictive analytics for machine learning and deep learning 149 transactional databases, weblogs, images, videos, sensors, social media, or other sources. Therefore, we will find big data at the core of predictive analytics. With the help of all these data, companies can build predictive models using machine learning and a deep learning algorithm. Using predictive analytics, companies can better understand their customers, recognize untapped opportunities, anticipate future threats, or even optimize their operational process (Office of the Privacy Commissioner of Canada, 2012).
WHAT IS DATA SCIENCE? Information is the new money in this digitalization era as we can now collect and store data using technology. Data science is not only a new buzzword for companies but also professionals are diving into the data science field to play with data. Data can provide us with valuable insights that businesses seek to use for more informed and data-driven decision-making. In addition, companies are now using data science to improve their services and products to achieve competitive advantages. Data science does not have any standardized definition; however, everyone agrees that it incorporates many fields such as data mining, machine learning, data visualization, statistics, mathematics, and computer science to extract knowledge from data. According to the tech giant IBM (IBM Cloud Education, 2020), data science is a method of data extraction to decipher insight by combining scientific methods, mathematics, statistics, programming, artificial intelligence, and analytics. Generally, data science is the amalgamation of qualitative and quantitative approaches that solve real-world problems and help predict trends and patterns (Waller and Fawcett, 2013). Farias and Shanthikhumar (2021) believe that data science creates research capabilities to increase the human skills and capacity to solve business problems with the help of a large dataset and leveraging technology. Likewise, Kelleher and Tierney (2018) also mentioned in their paper how we could extract patterns from the large dataset using machine learning and data mining techniques. In fact, from the name itself, it is apparent that data is the core of the data science field. Ultimately, we try to discover, extract, compile, process, analyze, interpret and lastly visualize the data under the domain of data science. The new buzzword “Big Data” is also an application of data science that deals with vast data. Nowadays, data is everywhere; for instance, imagine the daily online content people share using social media, search engines, mobile devices, sensors, geolocation, software applications, radio frequencies, and so on. The image in Figure 10.1 represents the data science life cycle. The data science life cycle is a continuous process that includes data gathering from different sources, formatting data, storing it in the data warehouse for future analysis, examining data, analyzing data to gather insights, and lastly, presenting insights. It is apparent from the data life cycle that machine learning and deep learning are not only a part of the data science process but also vital components to developing predictions.
WHAT IS MACHINE LEARNING? Before explaining machine learning, it is essential to describe artificial intelligence (AI). However, according to Abbass (2021), defining AI is not an easy task because AI is omnipres-
150 Handbook of big data research methods
Source:
IBM Cloud Education (2020).
Figure 10.1
Life cycle of data science
ent and still evolving, and many organizations are in the inception stage of adopting AI. The Father of AI, John McCarthy, provided the best definition. He established AI as the science and engineering process of creating intelligent machines (McCarthy, 2004). Likewise, Nilsson (2010) defined AI as an activity that makes machines smart to attain foresight. So, to develop prudence in the machine, AI requires improvement in machines, and therefore, the need for machine learning emerges. We can assume the machine is learning from the term “Machine Learning”. In simple words, we can say the machine needs to improve its performance. Zhou (2021) described in the book Machine Learning that machine learning (ML) is the core of AI, or we could say machine learning is a subset of AI. Also, he mentioned machine learning uses computational power to expand the performance of machines through utilizing past experiences. Likewise, Tech Giant IBM also labeled machine learning as a division of both artificial intelligence and computer science, which uses mathematics and data to replicate the learning patterns of humans and eventually tries to improve its performance level. According to Gopal (2019), machine learning is modern computer-based mathematics and statistical application. In simple words, machine learning is the ubiquitous and powerful artificial intelligent approach that can transform businesses. For instance, Netflix uses machine learning to understand user preferences to predict movies for users and suggest titles accordingly. All industries, manufacturing or banking, health care or pharmaceuticals, retail or energy, are now using ML technology to achieve competitive advantage. According to Zhou et al. (2017), ML has created a widespread buzz in applications like computer visions, natural language processing, IoT technology, and so on. They determined that big data fuel ML algorithms to attain deeper granularity and diversity for pattern and trend recognition. They established that the inception of big data had prompted ML technology to flourish. For example, you can now predict a financial crisis or GDP using the historical data by employing a machine learning approach. Health care is currently using a machine learning approach to develop medicines or make diagnoses. Almost all industries are now adopting the machine
Predictive analytics for machine learning and deep learning 151 learning approach. During the COVID-19 era, researchers used machine learning extensively to develop COVID-19 vaccines (Vaishya et al., 2020). The machine learning model uses algorithms to train models based on samples of historical data to make predictions to perform the needed task. Big data and ML are interrelated, and big data also helps ML solve the scalability problem. Vaishya et al. (2020) also mentioned that ML tries to decipher patterns and structures to forecast future outcomes with the help of a big dataset because ML requires algorithms, large data, and computing power to achieve better prediction capability. Therefore, they established that big data analytics is an imperative part of the machine learning technique. Similarly, Obermeyer and Emanuel (2016) explained that machine learning algorithms require millions of data to achieve a high-level prediction capability. They also mentioned that machine learning algorithms perform better with bigger datasets as they are data-hungry; therefore, big data analytics is vital for ML to achieve better efficiency and accuracy in predictions. Today, accumulating a large amount of data is not very difficult in order to develop predictive models as we are swimming in data. We can collect data from trillions of websites. Every minute an individual uploads a video on YouTube. Every minute someone creates content for social media. Likewise, customers at Walmart contribute to millions of transactions every day. All these contents and customer data generate a deluge of data, which we can consider big data. Murphy (2012) divided machine learning techniques primarily into two categories: the supervised or predictive approach, and the unsupervised or descriptive approach. We will discuss the different sections of machine learning later in this chapter. Before we discuss how we use machine learning for predictive analytics, we must discuss deep learning, as without machine learning and deep learning, predictive analytics might not exist. Usually, the process of machine learning starts with gathering data from different sources and then converting the dataset into a training and testing set. First, we train the model for prediction on the training dataset, and later we use a testing dataset for prediction. Both training and testing datasets are the input for the model, and prediction is the final output (see Figure 10.2).
Figure 10.2
General machine learning process for prediction
152 Handbook of big data research methods
WHAT IS DEEP LEARNING? Previously, we learned that machine learning is a subset of AI. Likewise, deep learning is a subset of machine learning, and deep learning tries to replicate the human brain (IBM Cloud Education, 2020). Brownlee (2019a) also agreed that deep learning is an extended version of machine learning, which follows brain functionality. Deep learning and machine learning follow the same process; however, the deep learning algorithms follow neural networks to perform the repeated task to find the desired outcome. The deep learning technique improved the performance of machines compared to the machine learning approach. So, we can outline that to enhance machines’ performance level and scalability, the machine learning domain came up with a deep learning approach as an expansion that follows a neural network. Deep learning is becoming the domain leader because of its fast-paced information processing capabilities (Pouyanfar et al., 2019). Today, we can build big neural networks with the help of the deep learning approach because we have a staggering amount of data and better computational power. The deep learning algorithm uses several layers to learn from previous experiences and attempts to improve the result with the repetitive task to reach the final outcome (Marr, 2018). According to Shi et al. (2017), deep network architecture is flourishing because digitization and high-performance computing overcome significant technical constraints to execute the deep learning approach. They were able to solve the overfitting issue by using huge data and state-of-the-art techniques. Therefore, it explains that now we have enough data for deep learning models, high-computing resources, and many efficient algorithms to build deep neural networks. With the advent of deep learning, we can solve feature engineering problems, which machine learning initially struggled to achieve with precision. Deep learning is concerned with the multiple layers that continuously extract features from input data. Deep learning made it possible to obtain feature value for image recognition and voice recognition (Sugomori et al., 2017). Likewise, Herrera-Flores et al. (2019) mentioned that the last decade had seen revolutionary advancements in the machine learning field for image and speech recognition as big IT companies like Google, Facebook, Amazon, Apple and Tesla are using the deep learning technique to develop state-of-the art technology, software, and hardware. Deep learning is an extended version of the machine learning technique that deals with the neural network. The deep learning approach uses a complex neural network to identify features for prediction. Meanwhile, machine learning is a subset of AI, which tries to make the machine an intelligence predictor (see Figure 10.3). In 1962, John W. Tukey, an American mathematician, was the first to express the data science vision. Nearly two decades before the first personal computers were available; he predicted the emergence of a new field in his now famous article “The future of data analysis”. When it came to the emerging field of “data science”, Tukey was ahead of his time, but he wasn’t the only one. Data stockpiles have grown enormously since the turn of the century, owing to advances in processing and storage that are both efficient and cost-effective at scale. The ability to collect, process, analyze and present data and information in “real-time” provides us with a once-in-a-lifetime opportunity to engage in a new type of knowledge discovery (Liguori, n.d.). In addition, data scientists want excellent performance from a vast array of technologies to speed up jobs and data processing in seconds to process this massive volume of data. In 1952, Arthur Samuel, an IBM computer scientist and a pioneer in artificial intelligence and computer gaming, created the term “machine learning”. That’s when he came up with
Predictive analytics for machine learning and deep learning 153
Figure 10.3
Relationship of AI, machine learning and deep learning
the idea for computer software for the board game of checkers. Due to a minimax algorithm for evaluating movements to develop winning strategies, the more the program played the game, the more it gained from its experience. According to Najafzadeh and Ghanbari (2020), machine learning is required when humans cannot determine data due to vast volumes, when an expert is unavailable when an issue changes over time, and when ambient conditions, such as routing in computer networks, are present. Parallel Distributed Processing, a two-volume series of work by David Rumelhart, James McClelland and the PDP Research Group, was published in 1986 and expanded the use of neural network models for machine learning. Deep Blue became the first computer chess system to defeat a reigning world champion in 1997. Deep Blue took advantage of the improved computer power available in the 1990s to do large-scale searches of potential movements – it was said to be capable of processing over 200 million per second – before selecting the best one. In 2011, IBM’s Watson defeated two of the show’s champions in a round of Jeopardy (Cohen, 2021). In 2016, AlphaGo, a system designed by Google DeepMind researchers to play the ancient Chinese game of Go, defeated Lee Sedol, the world’s top Go player, four out of five times for almost a decade. Deep learning is a subset of machine learning, as we all know. Deep learning was first introduced in the year 1965. The origins of deep learning may be traced back to 1943 when Warren McCulloch and Walter Pitts developed a computer model based on human brain neural networks. Warren McCulloch and Walter Pitts employed a combination of mathematics and algorithms they called threshold logic to simulate the mental process. Since then,
154 Handbook of big data research methods deep learning has progressed steadily, with two crucial pauses in its development. Henry J. Kelley is credited with developing the fundamentals of a continuous backpropagation model in 1960. In 1962, Stuart Dreyfus devised a simplified version based solely on the chain rule. Backpropagation was first proposed in the early 1960s, although it was not widely used until 1985. Backpropagation, which employs errors to train deep learning models, was invented in the 1970s. It gained popularity after Seppo Linnainmaa published his master’s thesis, which included FORTRAN code. Although the notion was discovered in the 1970s, it was not applied to neural networks until 1985, when Hinton and Rumelhart demonstrated backpropagation in a neural network that might produce intriguing distribution representations. Yann LeCun detailed how he used convolutional neural networks and backpropagation to read handwritten digits at Bell Labs in 1989, which was the first practical demonstration of backpropagation (Some, 2018). A mix of convolutional neural networks and a backpropagation mechanism was used to read the numbers on handwritten checks. Artificial intelligence will evolve in the future, depending on deep learning. Deep learning is still in its infancy, and it is in constant need of new and innovative ideas to progress. Predictive analytics has been around for almost 75 years, although it has only recently become popular. Predictive analytics began in the 1940s when governments first began to use computers. It has evolved into a notion when businesses recognize the need, even though it has been around for decades (Bellapu, 2021). In the 1950s, the history of predictive analytics shifted once more as the technique spread across a wide range of industries. Computerized predictive modeling has been integrated into the normal operations of weather forecasters, the transportation and shipping industries, and government research agencies. In addition, organizations have begun to use predictive analytics to increase profits and strengthen their competitive advantage as more data becomes available. The application of predictive analytics has been boosted by the constant development of stored data and a growing interest in leveraging data to acquire business intelligence. It is being used in various businesses and functional areas for things like insurance underwriting, fraud detection, risk management, direct marketing, upsell and cross-sell, customer retention, collections, and other things.
DIFFERENT MACHINE LEARNING AND DEEP LEARNING APPROACHES FOR PREDICTIONS Traditionally, the machine learning technique was divided into supervised learning, unsupervised learning, and reinforcement learning, based on the available learning feedback. Previously, we discussed data science, artificial intelligence, machine learning, and deep learning. We need to dive deeper to understand different machine learning approaches, which is necessary to select the correct machine learning algorithm to solve problems. Supervised Learning The word “supervised” refers to labeled data that is used during the model training process. It involves feature learning tasks from an input–output relationship with the help of supervisor learning algorithms (Russell and Norvig, 1995). Basically, during supervised learning, the model tries to replicate the human learning process. We could say the machine attempts to learn using examples more precisely. It is built on a mathematical model that takes labeled
Predictive analytics for machine learning and deep learning 155 input and then tries to learn and later present desired output. We give the machine labeled data, which we refer to as training data, to develop a model using learning algorithms. We use an un-labeled testing dataset to identify the features at the end after the training model shows a high degree of accuracy. Supervised learning is all about creating a function that will learn to solve a problem. For instance, we can identify spam emails using a supervised machine learning technique. We can label emails spam, not spam, and assign features to separate them. We can then develop a model on the labeled dataset for the machine learning algorithm to identify spam emails. The core of machine learning is the algorithm that tries to uncover knowledge embedded in the data. These algorithms set instructions for machines to learn patterns and trends to yield the best prediction. Classification and regression are the most essential supervised learning algorithms. Classification algorithms deal with classifying people or things into groups, which is mostly used for predictions. Basically, the classification algorithm tries to categorize an observation based on the values of variables. We can say that the machine learning algorithm will try to learn by drawing a conclusion after observing values and then determining to which group the outcome belongs. For instance, predicting credit card transactions as either fraudulent or valid can be an example of a classification problem. Regression algorithms try predicting numerical or categorical variables. It is mandatory for the regression model to understand the relationship between variables to predict. For example, predicting house purchases or customer expenditure based on historical data is a classic regression example. The model can predict if a person will be eligible to get a house mortgage based on different dependent variables that are associated with the mortgage decision, such as job, income, credit record, and so on. Now, let’s discuss some of the most widely used and standard supervisor learning algorithms. Linear Regression Linear regression is the most widely used machine learning algorithm utilized by statistics and machine learning. One of the key important facts for predictive modeling is minimizing error and making a prediction as accurate as possible. Linear regression tries to decipher the relationship between variables, where one variable works as the predictor and the other works as a dependent variable. Basically, it tries to find a linear relationship between dependent and independent variables. For prediction or forecasting, linear regression is used to fit a model using labeled data that has a response and defining features. The model can then be used to predict the response using unlabeled data. The model assumes the relationship between variables will be linear and makes a decision based on the linearity. For example, banks use the capital asset pricing model to analyze investment risk, which is primarily based on linear regression. We also use linear regression to predict consumption, spending, investment, imports, labor demands, and so on. Basically, it tries to locate the best fitting line to explain the variables where variables are continuous, such as weight, height. Logistic Regression Logistic regression deals with categorical dependent variables, which are binary and have a limited number of possible values. We cannot use linear regression for classification problems as linear regression is not bounded. To solve this unbounded problem, logistic regression
156 Handbook of big data research methods came into the picture. So, the logistic model uses probability to answer the occurrence of an event, such as if someone is dead or alive or failed or passed a course. Logistic regression uses a value between 1 and 0 to predict the occurrence probability for prediction. From a statistics point of view, logistic regression uses a logistic function that is not a straight line to model a binary dependent variable. For instance, you want to predict the pass and failure of a course, so you will assign (1) as pass and (0) as fail to predict the outcome. We will set a threshold for the prediction, and if the prediction value is above the threshold value, then a student will pass, and if the value is less than the threshold, then they will fail. Even though both algorithms are different from each other, we can change linear regression to logistic regression. Nearest Neighbor We use the K-nearest neighbors (KNN) algorithm to classify the data point based on the behavior of its neighbors. This algorithm tries to capture the similarities between the new data and the available class. It just stores data during model training instead of learning from data. When new input data is fed to classify the nature of data, its categories are based on data similarity which is being stored. In Figure 10.4, balls are assigned a specific number and classified as A and B. So, the new red ball will match Class A or Class B characteristics to predict which neighboring red ball matches.
Source:
Jones (2017).
Figure 10.4
Three learning models for algorithms
Support Vector Machines The KNN algorithm is used to assign a class to the data point following the characteristics of its neighbors (grouped data points). Support Vector Machines (SVM) tries to classify data points based on dimensional space and hyperplanes (decision boundary). SVM classifies data points based on its closure association of the hyperplane and its dimension depending on the
Predictive analytics for machine learning and deep learning 157 feature’s numbers. SVM simply represents the coordinates of individual events. You need to identify the right hyperplane to classify an input variable. Neural Networks The neural network is the core of the deep learning concept. A neural network tries to replicate the operation of the human brain to understand the data input to predict the final output. The network builds several layers, which is why it is known as a deep neural network. The mathematical function called a neuron usually classifies information and contains nodes or perceptrons which send signals from one layer to another. Each node layer contains an input layer, many hidden layers, and an output layer (see Figure 10.5). Each node uses weight and threshold values to send a signal or output to the next layer. If the value of one layer output is above the threshold, it sends the signal or output to the next layer. The neural network uses training data to learn and enhance the accuracy level so that we can predict the outcome accurately. One interesting fact about neural networks is that they can be used for both supervised and unsupervised learning.
Source:
José (2018).
Figure 10.5
K nearest neighbour
Decision Tree Decision tree learning is a predictive modeling approach that starts observing an item and ends in the target value. It creates a tree-like structure to come to a conclusion where leaves represent the class label, and its branches lead to the test outcome. Basically, a decision tree is a type of flow chart that illustrates the decision-making process and uses else–if conditions to classify the conditions. It has root, branches, and leaf nodes as its elements to lead you to the final outcome. We can analyze possible consequences using the decision tree approach as it provides possible options for a problem. According to Rokach and Maimon (2007), the
158 Handbook of big data research methods decision tree creates a hierarchy of decisions and possible outcomes. Decision trees can be used for both regression and classification problems.
Source:
IBM Cloud (2020).
Figure 10.6
Neural network
Unsupervised Learning Like supervised learning, it is also clear from the word “unsupervised” that the data are raw and un-labeled. Basically, we try to identify features, hidden patterns, and trends, or draw inferences using a dataset that has no label for the machine to understand the input–output relationship. Unlike supervised learning, unsupervised learning relies only on input data for prediction as it has no variable for the machine to identify available outcomes (Brownlee, 2019b). Unsupervised learning deals with the self-observation method that tries to understand patterns based on the probability density of the inputs (Hinton and Sejnowski, 1999). To be more precise, during unsupervised learning, algorithms learn inherent structure about features or unknown insights from hidden data. Unsupervised algorithms try to uncover similarities and differences, making it a perfect approach for data analysis, customer segmentation, image recognition, and so on. It offers an exploratory path to analyze data. Clustering, association rule and dimensionality reduction techniques are mainly used to develop an unsupervised learning model. Clustering algorithms group raw, unlabeled data, and attempt to find patterns or structures inside the data. They basically separate the data points based on similar traits and then group them into clusters. The recommender system is an example of a clustering method; for instance, a YouTube song suggestion or an Amazon product suggestion are classic examples of a clustering technique. Next is the association rule, which tries to discover the relationship between the variables. It tries to predict customers’ buying behaviors and it is used to conduct market basket analysis. For instance, supermarkets use association rules to push people to buy products used together. You will find egg and milk next to each other as people usually buy milk and eggs simultaneously. Supermarkets use the association rule for
Predictive analytics for machine learning and deep learning 159 marketing purposes. It helps companies to find the relationship between products that customers might buy together regularly. K-Means Clustering The goal of the K-means algorithm is straightforward: to create groups of similar data points to discover patterns and trends. K-means algorithms try to locate K-numbers of center points known as centroids; after locating the centroid, algorithms assign data points a place to the closest cluster. The main goal of the K-means algorithm is to minimize the total distance between the data points to assign data points a place to a nearby cluster. The K in the K-means algorithm represents the cluster number, and each data point usually belongs to only one cluster. It actually requires three points: finalize the centroid, measure the distance of each data point from the centroid, and select the closest centroid to group the data point. You can use K-means clustering to identify the location where criminal activity is higher or you can cluster customers based on their purchase history or interest. In Figure 10.7, two center points have been selected. The new data points will be groups either with cluster 1 or cluster 2, based on the closest distance to the centroid.
Source:
Raghupathi (2018).
Figure 10.7
K-means clustering example
A Priori Algorithm We have already discussed the association rule; we use the a priori algorithm for association rule learning. It follows three matrices to find out the relationship between products. First, it tries to measure how many times an item or combination of items is available in the dataset. We can say it measures the frequency of occurrence for items. Then it calculates the likelihood of
160 Handbook of big data research methods customers to buy the product combination, or we could say the probability of buying. Lastly, it measures the strength of the product association. Retailers use association rule mining, which is called market basket analysis, to uncover associations between different products that can be sold as a combination. In other words, retailers try to place similar products around each other for promotion or lure customers into buying them. For instance, if bread, butter and eggs are placed near each other, the customer might pick up all these three items when browsing for any one of them. Dimensionality Reduction The name itself explains the main aspect of dimension reduction, which is reducing dimension. Sometimes too many input variables can create overfitting problems for predictive models and might not provide an accurate result. It always provides better results when we lower the number of input variables. This technique is used to convert the higher dimensions dataset into the lower dimension, while it tries to keep the information as subtle as the original one. Basically, dimension reduction reduces features from the dataset to reduce redundancy and complexity from the model before training the predictive model. For instance, if your dataset has too many irrelevant features and only affects 2–5 percent to locate the target, you can remove those features to make a more simple and less complex model that will give a better prediction. Principal component analysis is one of the most used and standard dimensionality reduction algorithms, which is used extensively to minimize model complexity for prediction. Reinforcement Learning Reinforcement learning focuses on reward or punishment-based learning techniques. The idea is to discover a model with the help of an intelligent agent who tries to accumulate maximum rewards to reach the main target. Basically, reinforcement learning creates a game-like environment, where the agent tries to use a trial and error approach to resolve a solution. Reinforcement learning is not a subpart of machine learning or deep learning; it is an artificial intelligence technique used to solve particular cases. Basically, it is all about creating a learning process in an environment through a reward system. The agents that we can refer to as learning behavior try to interact with the environment and use observation to understand the right pathway towards the goal point. It is like exploring a new and unknown environment and learning the right course of action with the help of reward and punishment. This machine learning technique creates a model which takes decisions sequentially. It is challenging to separate machine learning, deep learning, or reinforcement learning. All these three techniques are used to build prediction models.
USE CASES OF PREDICTIVE ANALYTICS The use case of predictive analytics is revolutionary and every company in the future will use machine learning and deep learning for prediction. For any business, losing a customer is extremely expensive, and companies fight hard to retain their customer segment. With the help of predictive analytics, companies can prevent churn and detect unhappy customers early and take the necessary steps to retain them.
Predictive analytics for machine learning and deep learning 161 Companies are now using sentiment analysis to measure the company’s reputation using social media reviews, posts, and feedback. Predictive analytics is playing a pivotal role for the retail industry to predict customers’ buying behavior. Based on the prediction, the company can set strategies for its marketing and promotional campaigns. Algorithms can analyze market trends and customers’ buying patterns and provide recommendations based on the prediction. It is also now possible to predict product demands and determine inventory needs for a company. With the help of machine learning and deep learning, predictive analytics can provide an efficient logistic suggestion for companies. Logistics companies can determine the fastest and most efficient route or even analyze driving behavior, reducing cost and monitoring traffic situations. Likewise, banks and financial services use historical data extensively to identify fraud and threats as a safeguard from unwanted network intrusion and risky customers. Xin et al. (2018) explained how using machine learning and deep learning methods can control cybersecurity by detecting network intrusions. They used many algorithms in their paper, such as support vector machines, K-nearest neighbor, decision trees, neural networks, and introduced the latest ML and DL application for intrusion detection. According to Shah et al. (2018), the health care system has seen rapid growth in health care data, which created big data platforms, and machine learning approaches promising better prediction possibilities. They believe that big data and predictive analytics together can provide better and more efficient medical care, mainly in the field of image analytics. For instance, during the time of the Covid pandemic in the 2020s we have seen how hospitals worldwide struggled with the influx of patients. Predictive analytics could help hospitals to identify future hospital admissions earlier. For instance, if the hospital can predict possible hospital admission, hospitals can take action to minimize future admission by inquiring about those potential patients and providing health suggestions to prevent them from falling sick. Using patients’ historical data, we can use machine learning and deep learning algorithms to predict hospital admission. According to Bresnick (2018), the University of Pennsylvania uses a predictive analytical tool to track severe sepsis cases 12 hours early using a machine learning technique. Also, Duke University’s study revealed that using machine learning, predictive models can forecast no-show patients, freeing up space for other patients. Arunachalam et al. (2019) provided an excellent example of detecting tumors using images leveraging machine learning and deep learning models. Predictive analytics capabilities for the health care sector are profound, and big data is opening more significant opportunities to create more accurate predictive models. Manufacturing companies are now using predictive analytical models and the Internet-of-Things (IoT) to get alerts about equipment maintenance and possible machinery breakdown. Predictive analytics is also becoming popular in the sports industry. Big teams are now using predictive analytics to measure the performance level of players before signing big-budget contracts. In addition, it is now possible to use players’ statistics and historical data to predict their future value and performance. García and Huerta (2020) mentioned how, using ML and DL applications, companies can reduce their costs, which can help them achieve a competitive advantage. They provided an example of Mexico University, which worked with a private company, leveraging ML and DL technology to implement intelligent traffic lights and recognize engine failure. Predictive analytics can also help supply chain or manufacturing operations run smoothly and save production disruption. Moreover, the IoT allows us to record real-time telemetry about the production process. With the help of this IoT data, companies can create models that will alert them to possible machine failure or the need for machinery
162 Handbook of big data research methods maintenance. Also, if the airline industry can predict mechanical disruption earlier, they can save a vast amount of time and money from possible flight delays or cancellations. According to a Harvard review, the Microsoft team created a tool to predict flight delays or cancellations based on flight route and maintenance history. This type of model certainly can help airline companies save money and time and save people from unwanted hassle (LaRiviere et al., 2016). Weather forecasts used to be very volatile in the past, but now, we are getting quite accurate weather predictions thanks to predictive analytics and machine learning. Weather forecasts are significant for day-to-day life because we prepare ourselves every day following the weather forecast to avoid unwanted situations. Also, Zhang et al. (2021) explained in their book how we could leverage machine learning and deep learning technique to predict the occurrence of geohazards or failure of a structure. Similarly, ML and DL algorithms can help to detect vibration-based structural damage for the domain of civil engineering (Avci et al., 2021). All these examples provide enough evidence that predictive analytics, with the help of ML and DL applications, can change the life of almost all industries.
CONCLUSION Predictive analytics is not a new topic; however, its popularity and usage are rife thanks to Big Data, machine learning, and deep learning technology. Businesses and investors are now keen to use predictive analytics to achieve competitive advantage by reducing risks, optimizing operational efficiency, or finding ways to increase revenue. Companies are currently developing predictive models using machine learning and deep learning algorithms. Predictive models are not limited to one industry. Everyone, from sports to business, health care to entertainment, banks, or governments are now trying to leverage predictive analytics for better decision-making and strategic implementation. It is essential to understand the concept of artificial intelligence and data science to understand predictive analytics. Machine learning and deep learning techniques are the core of creating predictive models. It is also necessary to understand different types of learning, supervised or unsupervised, or reinforcement learning, as the heart of any predictive model is the algorithms. Even though machine learning and deep learning are still evolving with time and require more study and research, the benefit of implementing these two techniques to forecast something is extraordinary. Predictive analytics is an emerging business practice for industry growth and provides data-driven insights to improve efficiency, profits, customer base, save time, and identify issues to resolve problems. Finally, we could say, the benefit and the importance of predictive analytics for every sector is enormous.
REFERENCES Abbass, H. (2021). What is Artificial Intelligence? IEEE Transactions on Artificial Intelligence, 2(2), 94–5. Arunachalam, H.B., Mishra, R., Daescu, O., Cederberg, K., Rakheja, D., Sengupta, A., Leonard, D. et al. (2019). Viable and necrotic tumor assessment from whole slide images of osteosarcoma using machine-learning and deep-learning models. PloS One, 14(4), e0210706–e0210706. Retrieved from doi:http://dx.doi.org/10.1371/journal.pone.0210706.
Predictive analytics for machine learning and deep learning 163 Avci, O., Abdeljaber, O., Kiranyaz, S., Hussein, M., Gabbouj, M. and Inman, D.J. (2021). A review of vibration-based damage detection in civil structures: From traditional methods to Machine Learning and Deep Learning applications. Mechanical Systems and Signal Processing, 147, 107077. Barga, R., Fontama, V. and Tok, W.H. (2015). Predictive Analytics with Microsoft Azure Machine Learning. Berkeley: Apress. Bellapu, A. (2021). Evolution of analytics over the years. Analytics Insight, 17 January. Retrieved from analyticsinsight: https://www.analyticsinsight.net/evolution-of-analytics-over-the-years/. Bresnick, J. (2018). 10 high-value use cases for predictive analytics in healthcare. Health IT Analytics, 4 September. Retrieved from healthitanalytics: https://healthitanalytics.com/news/10-high-value-use -cases-for-predictive-analytics-in-healthcare. Brownlee, J. (2019b). 14 different types of learning in machine learning. Machine Learning Mastery, 11 November. Retrieved from machinelearningmastery: https://machinelearningmastery.com/types-of -learning-in-machine-learning/. Brownlee, J. (2019a). What is Deep Learning? Machine Learning Mastery, 16 August. Retrieved from https://machinelearningmastery.com/what-is-deep-learning/. Cohen, S. (2021). The evolution of machine learning: Past, present, and future. Artificial Intelligence and Deep Learning in Pathology, 1-12. Retrieved from doi:https://doi.org/10.1016/B978-0-323-67538-3 .00001-4. Farias, F.V. and Shanthikumar, G.J. (2021). Editorial statement – data science. Management Science, 67(11), v. García, H.C. and Huerta, R.E. (2020). Machine learning and deep learning patentable developments and applications for cost reduction in business and industrial organizations. International Journal of Management and Information Technology, 9–18. Gopal, M. (2019). Applied Machine Learning. New Delhi: McGraw-Hill Education. Herrera-Flores, B., Tomás, D. and Navarro-Colorado, B. (2019). A systematic review of deep learning approaches to educational data mining. Complexity. Retrieved from doi:https://doi.org/10.1155/2019/ 1306039. Hinton, G. and Sejnowski, T.J. (1999). Unsupervised Learning: Foundations of Neural Computation. Cambridge, MA, USA and London, UK: MIT Press. IBM Cloud (2020). Neural Network, 17 August. Retrieved from https://www.ibm.com/cloud/learn/ neural-networks IBM Cloud Education (2020). Data Science, 1 May. Retrieved from https://www.ibm.com/cloud/learn/ deep-learning. Jones, M.T. (2017). Models for machine learning. IBM Developer, 5 December. Retrieved from https:// developer.ibm.com/articles/cc-models-machine-learning/. José, I. (2018). KNN (K-Nearest Neighbors) #1. Towards Data Science, 8 November. Retrieved from https://towardsdatascience.com/knn-k-nearest-neighbors-1-a4707b24bd1d. Liguori, G. (n.d.). Data science history and overview. KD nuggets. Retrieved from https:// www .kdnuggets.com/2020/11/data-science-history-overview.html. Kelleher, D.J. and Tierney, B. (2018). Data Science. Cambridge, MA: The MIT Press. LaRiviere, J., McAfee, P., Rao, J., Narayanan, V.K. and Sun, W. (2016). Where predictive analytics is having the biggest impact. Harvard Business Review, 25 May. Maciejewski, R., Hafen, R., Rudolph, S., Larew, S.G., Mitchell, M.A., Cleveland, W.S. and Ebert, D.S. (2011). Forecasting hotspots: A predictive analytics approach. IEEE Transactions on Visualization and Computer Graphics, 17(4), 440–53. Marr, B. (2018). What is deep learning AI? A simple guide with 8 practical examples. Forbes, 1 October. Retrieved from: https://www.forbes.com/sites/bernardmarr/2018/10/01/what-is-deep-learning-ai-a -simple-guide-with-8-practical-examples/?sh=5ad29a278d4b. McCarthy, J. (2004). What is Artifical Intelligence? Pp. 2–14. Retrieved from http://jmc.stanford.edu/ artificial-intelligence/what-is-ai/index.html. McCarthy, R.V., McCarthy, M.M. and Ceccucci, W. (2022). Applying Predictive Analytics: Finding Value in Data. Cham: Springer. Retrieved from doi:https://doi.org/10.1007/978-3-030-83070-0. Murphy, K.P. (2012). Machine Learning – A Probabilistic Perspective. Cambridge: MIT Press.
164 Handbook of big data research methods Najafzadeh, S. and Ghanbari, E. (2020). Machine learning. In U.N. Dulhare, K. Ahmad and K.A. Ahmad (eds), Machine Learning and Big Data – Concepts, Algorithms, Tools and Applications (pp. 155–205). Hoboken, NJ: John Wiley & Sons. Nilsson, N.J. (2010). The Quest for Artificial Intelligence. New York: Cambridge University Press. Obermeyer, Z. and Emanuel, J.E. (2016). Predicting the future: Big data, machine learning, and clinical medicine. The New England Journal of Medicine, 375, 1216–19. Office of the Privacy Commissioner of Canada (2012). The age of predictive analytics: From patterns to predictions. Ottawa: Privacy Commissioner of Canada. Pouyanfar, S., Sadiq, S., Yan, Y., Tian, H., Tao, Y., Reyes, M.P. and Shyu, M.L. et al. (2019). A survey on deep learning: Algorithms, techniques, and applications. ACM Computing Surveys, 51(5), 1–37. Raghupathi, K. (2018). 10 interesting use cases for the K-Means algorithm. Dzone, 27 March. Retrieved from https://dzone.com/articles/10-interesting-use-cases-for-the-k-means-algorithm. Rokach, L. and Maimon, O. (2007). Data Mining with Decision Trees:Theory and Application. Singapore: World Scientific Publishing. Russell, S. and Norvig, P. (1995). Artificial Intelligence: A Modern Approach (3rd edn). Toronto: Prentice Hall. Shah, N.D., Steyerberg, E.W. and Kent, D.M. (2018). Big data and predictive analytics: Recalibrating expectations. AMA: The Journal of the American Medical Association, 27–8. Shi, H., Xu, M. and Li, R. (2017). Deep learning for household load forecasting: A novel pooling deep RNN. IEEE Transactions on Smart Grid, 9(5), 5271–80. Retrieved from doi: 10.1109/ TSG.2017.2686012. Some, K. (2018). The history, evolution and growth of deep learning. Analytics Insight, 31 October. Retrieved from https://www.analyticsinsight.net/the-history-evolution-and-growth-of-deep-learning/. Sugomori, Y., Kaluza, B., Soa, F.M. and Souza, A.M. (2017). Deep Learning: Practical Neural Networks with Java. Mumbai: Packet Publishing. Vaishya, R., Javaid, M., Khan, I.H. and Haleem, A. (2020). Artificial Intelligence (AI) applications for COVID-19 pandemic. Diabetes & Metabolic Syndrome: Clinical Research & Reviews, 14(4), 337–9. Retrieved from doi:https://doi.org/10.1016/j.dsx.2020.04.012. Waller, M.A. and Fawcett, S.E. (2013). Data science, predictive analytics, and big data: A revolution that will transform supply chain design and management. Journal of Business Logistics, 34(2), 77–84. Xin, Y., Kong, L., Liu, Z., Chen, Y., Li, Y., Zhu, H., Gao, M. et al. (2018). Machine learning and deep learning methods for cybersecurity. IEEE Access, 6, 35365–81. Zhang, W., Zhang, Y., Gu, X., Wu, C. and Han, L. (2021). Machine Learning and Applications. Singapore: Springer. Retrieved from doi:https://doi.org/10.1007/978-981-16-6835-7_3. Zhou, L., Wang, J., Pan, S. and Vasilakos, A.V. (2017). Machine learning on big data: Opportunities and challenges. Neurocomputing, 237, 350–61. Retrieved from doi:https://doi.org/10.1016/j.neucom .2017.01.026. Zhou, Z.H. (2021). Machine Learning (S. Liu, trans.) Singapore: Springer.
11. Building a successful data science ecosystem using public cloud Mohammad Mahmudul Haque
1. INTRODUCTION The tools and strategies for extracting information from data are referred to as data science. Artificial intelligence (AI) is a branch of data science that deals with the science and engineering of constructing intelligent devices, particularly intelligent computer systems that can analyze data and produce results on their own. Machine learning (ML), a subset of AI, refers to the algorithms used in the data science process – specialized software programs that recognize patterns, make correlations, find anomalies, and forecast outcomes. These systems also let computers learn when they are exposed to new data and circumstances, allowing them to improve prediction accuracy as more relevant data is presented. As datasets get more complex and diverse, certain data science applications shift from Machine Learning to Deep Learning (Bisong, 2019). Data science techniques necessitate a large amount of data, which is frequently dispersed across multiple apps and databases and managed by various business units. Organizations require enterprise data management solutions to store and safeguard their data in order to get the most out of these machine learning endeavors. They will also require hardware and software to conduct data science applications. Finally, they must give results in a format that non-technical people and other information systems can understand. Data preparation and machine learning model development are time-consuming operations that typically necessitate the use of different specialist knowledge, software tools, open-source libraries, and the coordination of multiple contributors and business units. Data scientists must be able to analyze more data, create more models, and score more data on a large scale. Models must be implemented in an operational setting, then evaluated and modified as conditions change, in order for machine learning programs to yield concrete outcomes. Machine learning models, and thus the judgments made using such models, are only as good as the data that supports them. The more data and situations these models are exposed to, the smarter and more accurate they become. Despite this, data management remains one of the most difficult tasks for organizations while managing a Data Science platform on the premises. As a result, while machine learning is a difficult process, having the right tools makes it much easier. A significant component of this is deploying a Data Science ecosystem in the public cloud. Here are a few examples of how public cloud systems like Amazon Web Services, Oracle, and Google are constantly improving their machine learning platforms to help organizations achieve their business goals easily. Users can, for example, use the Data Science platform in the public cloud to train their models by adding a new library that contains the most up-to-date research from the AI/ machine learning community. They can get up and running quickly by combining pre-built use 165
166 Handbook of big data research methods cases for time-series modeling, Bayesian modeling, deep learning, anomaly detection, supervised modeling, and unsupervised modeling. They can connect to a variety of data sources and ingest data in a variety of forms using client libraries. The Oracle Cloud Data Science platform, for example, is data source agnostic. The service includes a number of client libraries that allow users to connect to a variety of data sources on Oracle Cloud Infrastructure as well as other clouds. This means that data scientists can combine data from a variety of sources to create a large and diverse dataset for training machine learning models (Build Machine Learning Solutions with Oracle’s Services and Tools, n.d.) Public cloud providers such as Oracle provide Machine Learning services designed for performance and scaling, and users can take advantage of auto-scaling. The autonomous database in Oracle Cloud can utilize up to three times more CPU and IO resources than specified by the number of OCPUs expressly assigned when auto-scaling is enabled. If your workload requires additional CPU and IO resources, auto-scaling is enabled, and the database uses those resources without any operator intervention (Build Machine Learning Solutions with Oracle’s Services and Tools, n.d.) The Accelerated Data Science (ADS) SDK from Oracle Cloud Infrastructure Data Science makes routine data science tasks faster, easier, and less error-prone. It has data access, profiling, and manipulation capabilities. ADS also includes Oracle’s AutoML engine for automatic model training, as well as a simple interface for model evaluation and interpretation (Build Machine Learning Solutions with Oracle’s Services and Tools, n.d.) AWS Cloud also offers SageMaker, which is a machine learning service that is wholly managed by Amazon. Data scientists and developers can use SageMaker to construct and train machine learning models fast and easily, then deploy them directly into a production-ready hosted environment. An organization does not need to manage servers on site because Oracle, AWS, Google and Azure have an integrated Jupyter writing notebook instance in the Cloud for easy access to data sources for exploration and analysis. For AWS SageMaker, it has an Autopilot function that evaluates tabular data, assesses the machine learning problem type (e.g., regression, classification), and chooses techniques to solve the problem (e.g., XGBoost). It also generates the data transformation code required to preprocess the data before training the model. Following that, Autopilot creates a number of machine learning model candidate pipelines, each with its own set of data transformations and algorithms. Let us look at some real-life customer cases in this regard. First, we will talk about Canopy. Canopy’s analytics team, as part of their day-to-day work, manually scanned through a customer’s financial documents from numerous sources at first. Canopy connects to about 400 custodian banks and receives data in a variety of formats, including APIs, data feeds, reporting services, and SWIFT format. Customer transaction statements, Excel files, Portable Document Format (PDF), and scanned images were also sent to the team, making customer data analysis a time-consuming and costly process. Canopy set out on a mission to automate the process and put the company in a better position for the future. They were spending hundreds of hours every week compiling financial statements, which was not sustainable for business growth. They started experimenting with open-source machine learning models on their own, and within a year and a half, they were able to semi-automate the processing of their clients’ financial data. Canopy, on the other hand, was unable to retrain the machine learning models while they were in use under
Building a successful data science ecosystem using public cloud 167 the previous arrangement, so it had to work at weekends to avoid platform downtime – the retraining process could take up to 48 hours per week. After moving to the cloud-based Data Science platform, Canopy was able to produce machine learning models and improve its OCR skills without having to hire more data engineers because the solution allowed the company to consolidate the creation, training, and deployment of machine learning models on a single platform. As new data is uncovered while scanning financial documents, AWS SageMaker will automatically update machine learning models (Canopy Case Study – Amazon Web Services (AWS), n.d.). Second, let us look at the use case for Pepperstone. When Pepperstone’s data science team switched to Amazon SageMaker for model training, they had little issue integrating it into their processes. They were able to save a substantial amount of time by hosting, training and deploying in the public cloud from the start. The time it takes to train machine learning models on Amazon SageMaker has decreased from 180 to 4.3 hours (Pepperstone Case Study – Amazon Web Services (AWS), n.d.) Let us now look at the case study for Moneytree. Moneytree takes advantage of Amazon SageMaker’s machine learning capabilities, which allow data scientists and developers to quickly design, create, train, and deploy high-quality machine learning models. Moneytree is also building a data lake in the Public Cloud, which will allow it to improve its products and optimize its operations in the future (Moneytree KK Case Study / AWS Marketplace / AWS, n.d.). Finally let us briefly discuss the use case for OakNorth. They use Amazon SageMaker to find new business opportunities and speed up loan closings. Using Amazon Web Services, OakNorth can quickly provision the resources needed to train and test new models (AWS). This leads to faster innovation, which benefits the company’s customers (OakNorth Case Study, n.d.).
2.
DATA SCIENCE PROCESS
Data science depends on a broad range of software tools, algorithms, and machine learning principles to uncover business insights hidden in vast volumes of data. Data scientists make data valuable by collecting, transforming, and turning data into predictive and prescriptive insights. The data science process involves components for data ingestion and serving of data models. However, we will discuss briefly the steps for carrying out data analytics in lieu of data prediction modeling. The major steps consist of the following: 2.1
Collecting Data
These processes can take up the majority of your data science workflow time, especially if your data is dispersed across different systems. It is critical to collect and combine data. However, unless you have an enterprise cloud data platform that supports all kinds of data, including semi-structured and unstructured data, it can take time away from other important data science tasks. To get the maximum predictive potential out of ML models, it should provide a single location where all relevant data can be accessed instantaneously. This includes information such as the number of variables, their data types, the number of observations, and the number/
168 Handbook of big data research methods percentage of missing data (Baum, 2021). Let us give an example of complexity of data collection in real life. The nib Group (nib) is one of the largest health insurers in Australia and New Zealand, with over 1.4 million members; nib introduced a cutting-edge service in 2015 that allows members to submit health insurance claims using a mobile app. Members get reimbursed for valid expenses relatively promptly – usually within 24 hours – after shooting and uploading healthcare receipts directly to the app. Despite the fact that it was a big step forward for members, nib’s claims team spent far too much time gathering data from receipts – such as the customer number, medicine, dose, dates, and provider number – and typing it into a database (nib Group Case Study, n.d.). 2.2
Visualize and Understand
This entails employing univariate and multivariate data visualization techniques to gain a better understanding of the properties of data variables and their interrelationships. Histograms, box and whisker plots, and correlation plots are examples of this type of statistic (Bisong, 2019). A data scientist must first comprehend the data before constructing a machine learning model: explore the data, collect statistics, check for missing values, generate quantiles, and look for data correlations in the data analysis step. A data scientist may occasionally want to quickly assess the data in our development environment and prototype some first model code. Maybe you only want to check out a new algorithm for a while. This is sometimes referred to as “ad hoc” exploration and prototyping, in which a data scientist queries portions of the data to gain an initial grasp of the data structure and data quality for the machine learning task at hand (Fregly and Barth, 2021). 2.3
Data Cleaning / Pre-processing
This procedure entails cleansing/ harmonizing the data in order to make it suitable for modeling. It is rather usual for data to be unclean, with each row indicating an observation and each column representing an entity. The responsibilities involved in this phase of a data science project may include deleting duplicate entries, determining a strategy for coping with missing data, and transforming data features to numeric data types of encoded categories, which refers to one hot encoding. This process involves sanitizing/ harmonizing the data to make it usable for modeling. This phase may also involve carrying out statistical transformation on the data features to normalize and/or standardize the data elements. Data features of wildly differing scales can lead to poor model results as they become more difficult for the learning algorithm to converge to the global minimum.
Figure 11.1
The six stages of the data science workflow
Building a successful data science ecosystem using public cloud 169 2.4
Feature Engineering
Feature engineering is the process of selecting a set of features from a dataset that are useful and relevant to the learning problem in a methodical manner. Irrelevant features frequently have a negative impact on the model’s performance. Some of the common methods used during feature engineering are: ● Statistical tests to pick the best features are among the strategies used; ● RFE (recursive feature elimination) is a technique for removing extraneous features from a dataset in a recursive manner; ● Feature importance using ensembled or tree classifiers; ● Principal component analysis to determine the components that account for the variation in the dataset (Bisong, 2019). 2.5
Train, Deploy and Monitor
In this phase, the data is fed through a learning algorithm to create a prediction model. This is typically an iterative process involving constant refinement in order to develop a model that better minimizes the cost function on the hold-out validation set and the test set (Bisong, 2019). Users typically utilize JupyterLab and other open-source libraries to construct and train machine learning models when establishing a machine learning pipeline. It is a time-consuming, iterative, and difficult procedure. Data scientists, on the other hand, can use the AutoML to automate model training with the use of cloud platform-provided machine learning services. Data feature engineering, method selection, hyperparameter tweaking, data transformation, model training, and model candidate selection are all automated. For data scientists, this saves a lot of time (Build Machine Learning Solutions with Oracle’s Services and Tools, n.d.).
3.
WHAT ARE THE CURRENT CHALLENGES IN A TRADITIONAL DATA SCIENCE PLATFORM?
Machine learning practitioners generally spend weeks or months constructing, training and optimizing their models while implementing the Data Science platform. They prepare the data and select the appropriate framework and method. Iteratively, machine learning practitioners seek the optimal algorithm for their dataset and problem type. Regrettably, there are no shortcuts available for this process in a standard on-premises setup. It is highly reliant on expertise, intuition and patience in order to conduct numerous experiments and determine the optimal hyper-parameters for our method and dataset. Only highly experienced data scientists may leverage years of expertise and intuition to choose the optimal algorithm for a given business challenge, but they must still validate their intuition with numerous training runs and model validations. Let us share a real-world scenario here. Pepperstone’s tech stack is built around machine learning (ML) and artificial intelligence (AI). The company has a data science team in Melbourne dedicated to constructing ML models, in addition to a 70-person IT team scattered
170 Handbook of big data research methods across four continents. Data scientists had to create their own algorithms at first, and it took 180 hours to train the models by comparing documents to millions of photos (Pepperstone Case Study – Amazon Web Services (AWS), n.d.). In another real-world scenario, Canopy’s data team would manually go through a customer’s financial paperwork from several sources when it first started operations, according to another case study. Canopy connects to about 400 custodian banks and receives data in a variety of formats, including application programming interfaces (APIs), data feeds, reporting services, and SWIFT format (Canopy Case Study – Amazon Web Services (AWS), n.d.). Customer transaction statements would also arrive in the form of emails, Excel files, Portable PDF, and scanned photos, making customer data analysis time-consuming and costly. Canopy set out on a mission to automate the process and make its company future-proof (Canopy Case Study – Amazon Web Services (AWS), n.d.). Every week, they were putting in hundreds of hours of menial labor to process financial statements, which was not sustainable for business growth. Later, they began experimenting on their own with open-source ML models, and within a year and a half, they were able to semi-automate the processing of their client’s financial data (Canopy Case Study - Amazon Web Services (AWS), n.d.). However, Canopy soon found a snag in its automation journey: the team had to constantly upgrade its machine learning models in order to recognize and interpret new data in 20 percent of the financial records received monthly. Despite spending less time evaluating customer data, the team now had to focus on data processing and improving data quality for the ML models, which took time away from managing client investments and relationships. They couldn’t retrain the ML models while they were in use, so they had to work at weekends to keep the platform up and running as much as possible – the retraining process may take up to 48 hours per week (Canopy Case Study - Amazon Web Services (AWS), n.d.). Typically, an organization faces the following challenges while preparing a Data Science ecosystem. 3.1
Impossible to Track All Data
Locating and determining who is accessing various forms of data – structured, unstructured and semi-structured – can be challenging. Due to the growth of data, it is impossible for an organization to identify, rationalize, and eliminate data silos. Additionally, it is exceedingly challenging due to the heterogeneity of data sources and rate of change data. The first step in reducing complexity is to centralize data in a single repository. It is too expensive to construct a robust compute and storage infrastructure that is also resilient (Baum, 2021). 3.2
Complex to Collaborate
The next significant challenge that an organization faces as a result of data silos is that it becomes extremely difficult for them to capture and share data within data science workflows, complicating collaboration between the working group’s data scientists, machine learning engineers, and data engineers. Additionally, machine learning models can only improve business processes if they have a sufficient amount of data pertaining to the challenges that the firm is attempting to solve (Baum, 2021).
Building a successful data science ecosystem using public cloud 171
Figure 11.2 3.3
Data unifies today’s productive data science teams
Data Governance
Data silos complicate the process of defining who is authorized to access information and how each sort of user may utilize it. All data governance plans should aim to safeguard sensitive data when it is accessed, shared, and traded within and outside of the company. These procedures, in an ideal world, should be undetectable to data scientists. To transition data science projects from prototype to production without committing data privacy or security violations, it is necessary to establish a rigorous access policy and enforce consistent controls. This requires procuring, deploying and integrating a variety of security solutions that are costly to procure, and complex to deploy, integrate and maintain. By consolidating all data in a centralized repository and adding a coherent layer of data governance services, you can implement universal policies that expand access while minimizing risk (Baum, 2021). 3.4
Enormous Compute Capacity Requirement
Prototyping on a data science platform generally requires only a single-machine development environment using Jupyter Notebook, NumPy, and pandas. This strategy works well with small datasets. While scaling out to work with enormous datasets, it will quickly exhaust the CPU and RAM resources of a single machine. Additionally, it is always necessary to accelerate our model training utilizing GPUs – or many workstations. Generally, this is not doable with a single machine. In an on-premises environment, the resources available to support production-grade data science initiatives are always constrained (Bisong, 2019). 3.5
Challenges During Model Deployment in Production Environment
The following difficulty occurs when we attempt to deploy our model (or application) to production. Additionally, we must guarantee that our program is capable of supporting a significant number of concurrent users on a global scale. Production deployment frequently necessitates close collaboration among multiple teams, including data science, data engineering, application development, and DevOps. Additionally, if our application is successfully
172 Handbook of big data research methods deployed, we must monitor and respond to model performance and data-quality issues that may develop after the model is pushed to production (Bisong, 2019). 3.6
Specialized Skill Required to operate the Data Science Tools
By and large, data analysts and other users seeking to benefit from machine learning lack programming skills and a thorough understanding of mathematics and statistics. You may want to quickly investigate raw datasets as a data scientist or machine learning engineer. Additionally, the business intelligence team may wish to take a subset of data from the data warehouse, which they can then convert and query using normal SQL clients in order to generate reports and show trends. All of these sample operations are complex in an on-premises setting and need specialized skills such as programming, in-depth statistical or mathematical understanding, and familiarity with data integration or ingestion tools.
4.
BENEFITS OF CLOUD PLATFORM FOR DATA SCIENCE
4.1
Tracking all your Data
Cloud computing can assist enterprises in successfully deploying a data science platform. Typically, these situations necessitate enormous amounts of data storage and processing power. The leading public cloud providers (AWS, Azure, Oracle and Google) make these resources affordable and plentiful, enabling data science teams to store nearly infinite volumes of data at progressively lower costs and process that data using powerful arrays of computers that can be scaled up and down at will. Some firms create data lakes to make sense of all of these different forms of data. Data lakes are an ideal foundation for data science and machine learning because they enable us to train and deploy more accurate models by providing access to massive and diverse information. A cloud data platform is a type of specialized cloud service that is intended for storing, processing and sharing massive amounts of data for a variety of analytic tasks. For instance, a cloud data platform reduces the inconsistency that occurs when separate workgroups use distinct copies of the data (Fregly and Barth, 2021). 4.2
Easier to Collaborate
The cloud platform unifies several workloads, including data lakes, data warehouses, data engineering, data science, data analytics, and data applications, into a single collaborative platform. It supports all phases of the data science workflow, from data exploration to model building to production model deployment and business-ready analytics. Additionally, it provides support for a wide variety of prominent machine learning frameworks, tools and languages, including SQL, Java, Scala, Python and R. Scalable data processing is enabled by multi-cluster computing – for any number of concurrent users and workloads. A cloud platform that supports a broad range of libraries, notebooks, frameworks and tools significantly speeds up the creation of predictive models and data science applications. It will enable access to structured, semi-structured, and unstructured data from both internal and external sources via a data marketplace – all while maintaining consistent security and control (Baum, 2021).
Building a successful data science ecosystem using public cloud 173 4.3
Data Governance
By consolidating an organization’s data on a cloud platform, data governance activities can be streamlined. Granular data access controls, such as object tagging, row-level access, and data masking, are supported by the cloud platform vendor. The cloud service provider (CSP) provides multi-layer security solutions as a service on a public cloud platform. All that is required is for an organization to enable and easily configure them in a few clicks. Additionally, the CSP is always responsible for infrastructure security, which relieves end users of many security-related configurations. Additionally, a cloud data platform integrates data security and governance processes by guaranteeing that all users have access to the same copy of the data (Baum, 2021). 4.4
Capacity on Demand While Reducing Cost
Cloud computing enables us to provision resources on-demand. This enables firms to conduct rapid and frequent experiments. For instance, a data scientist may wish to evaluate a new library for performing data quality checks on the dataset or accelerate model training using the latest generation of GPU compute resources. They can quickly build up hundreds, if not thousands, of servers to handle those duties. They can always deprovision such resources without risk if an experiment fails. Cloud computing enables us to shift away from a fixed capital expenditure approach toward a pay-per-use model. Organizations will pay for only the services they consume, eliminating the need for large upfront investments in technology that may become obsolete in a matter of months. They only pay for the period in which computational resources are used to perform data transformations or model training. They can save even more money. For example, by utilizing Amazon (AWS) EC2 Spot Instances for our model training, enterprises can leverage spare EC2 capacity in the AWS cloud at a saving of up to 90 percent compared to on-demand instances. Reserved Instances and Savings Plans enable us to make financial savings by prepaying for a specified period of time. Cloud platforms enable enterprises to scale resources up or down dynamically to meet the demands of our applications. For instance, if an enterprise deploys a data science application to production and their model is serving real-time predictions, they can use the automatic scale-up capability to host model hosting resources in the event of a model request spike. Similarly, they can scale down resources automatically when the number of model queries decreases. There is no reason to overprovision resources to meet peak load requirements (Fregly and Barth, 2021). 4.5
Easier to Rollout to Production
Additionally, one of the advantages of cloud-based data science development is the seamless transition from prototype to production. Within minutes, we can transition from running model prototyping code in our notebook to performing data quality checks or performing distributed model training across petabytes of data. And after that is complete, we can use our trained models to provide real-time or batch forecasts to millions of people worldwide. By developing data science projects on the cloud, organizations can swiftly transition their models from prototype to production without having to establish their own physical IT infrastructure. AWS, Google, Microsoft and Oracle public clouds give them the tools necessary
174 Handbook of big data research methods to automate operations and deploy models into a highly scalable and performant production environment (Fregly and Barth, 2021). 4.6
No Specialized Skills Required to Maintain in Cloud Compared to On Premises
A cloud data platform significantly reduces the amount of code that separates you from your data. A user working on a Data Science project can simply access a cloud service that determines the optimal method for the dataset, trains and tweaks the model, and deploys the model to production with a single click. Due to the fact that Data Science initiatives require access to organized, semi-structured, and some forms of unstructured data, organizations can leverage a cloud data platform to connect their data lake and data warehouse. This significantly accelerates the building of machine learning models for data scientists by making it easier to access all types of data and by assisting in the harmonization of all analytics efforts. For instance, the outcomes of data science experiments can be included in the platform and made accessible via reports and dashboards (Fregly and Barth, 2021). Data analysts and other users who wish to profit from machine learning but lack advanced programming skills or a thorough understanding of mathematics and statistics can use AutoML products such as Amazon SageMaker Autopilot, Oracle’s Autonomous Data Warehousing, and Google Cloud’s Vertex AI. These technologies simplify the process of selecting algorithms, training models, and finally selecting the optimal model for the business challenge at hand. Businesses can simply upgrade their data science ecosystems with ready-to-use AI services, regardless of whether their business demands bringing machine learning to the edge or is just getting started with AI/ML. By definition, all cloud platforms delegate infrastructure management to the cloud provider. Several critical characteristics of infrastructure capabilities include the following: ● Infinite performance and scalability, based on a pay-as-you-go mechanism; ● Near-zero maintenance; as the data cloud provider manages all resources, there is no need to do software updates, database tuning, or a variety of other administrative activities; ● A global presence with centralized security measures that adhere to the unique requirements for data localization and sovereignty (Fregly and Barth, 2021).
5.
CASE STUDIES ON ADOPTING CLOUD PLATFORM FOR DATA SCIENCE IN THE FINANCIAL INDUSTRY
AI and machine learning have been utilized to continuously improve compliance, surveillance, and fraud detection in the financial sector. They contribute to the acceleration of document processing, the creation of individualized pricing and financial product recommendations, and the assistance of traders in making trading decisions. Machine learning enables organizations in a variety of industries to address some of their most pressing difficulties. Let’s examine many industry-specific business sectors and use cases in which a company can leverage data science to solve problems while putting the entire ecosystem in the Public Cloud.
Building a successful data science ecosystem using public cloud 175 The following are some of the areas and use cases in which machine learning can be employed in the financial industry. 5.1
Machine Learning Aids in the Detection of Fraud, Anti-money Laundering Efforts, and Exposure to Credit Risk
Let us look at a real-world example. FWD is currently utilizing additional Google Cloud solutions. For example, Cloud Vision and AutoML power FWD’s Know Your Customer (KYC) identification verification. When their customer purchases insurance online, it is critical that they conduct quick, accurate background checks and so deliver policies in a timely manner. All of their services are delivered quickly as part of their customer experience and value proposition. They deployed Cloud Vision AI and Cloud AutoML to rapidly assess the validity of an ID submitted by their customers, scanning data fields ranging from date of birth to location. This AI-powered technology has increased their operational efficiencies by 20 percent, while reducing their ID verification costs by 50 percent. In a nutshell, it’s made a world of difference (FWD Case Study / Google Cloud, n.d.). 5.2
Machine Learning Enables the Generation of Automatic Compliance Reports, Solutions for Stress Testing and the Behavioral Analysis of Email Messages to Identity Questionable Employee Activity
Let us look at another real-world example. In HSBC, their partnership with Google for cloud technology is primarily in machine learning and data skills due to data security as well as faster accuracy. They have a team of data scientists, engineers, and architects working on transferring HSBC data and analytics operations to Google Cloud and leveraging AI and machine learning to use the data. This team collaborates with HSBC departments and business units to design and build data and AI solutions. For customer sentiment analysis, their team used AutoML Natural Language and Speech-to-Text from Google Cloud. With the use of Google’s AI-powered Speech-to-Text technology, they accurately transformed spoken Cantonese and English combinations. Moreover, they could run a machine learning model in an hour instead of a week when putting the Data Science platform on-premises, reducing their implementation time (HSBC Case Study / Google Cloud, n.d.). 5.3
Using Unsupervised Learning Algorithms and Methodologies, Banks may Segment their Consumers and Create Customized, Targeted Product Offerings
Let us look at another real-world example. For Forth Smart Thailand, their well-known kiosks began as a way for consumers to spend cash and coins to top up prepaid mobile phones and transfer monies between friends and family, and they now provide a much wider range of services and ebanking functionalities. With over 15 million users, the kiosks have become excellent real estate for advertising and internet package offerings. Forth Smart’s kiosk network, which processes over 2 million transactions every day, requires real-time visibility into client behavior. Forth Smart leverages Oracle Autonomous Data Warehouse, which runs on Oracle Cloud Infrastructure and doesn’t require a database administrator, to gain such insight while also safeguarding the data. Forth’s smart business analysts employ Oracle Analytics to apply
176 Handbook of big data research methods machine-learning algorithms to identify client categories and estimate how well an offer will perform, resulting in a two-fold increase in ad conversion rate (Forth Smart Connects Rural Economies Using Oracle Cloud, n.d.). 5.4
Machine Learning Enables Banks to Operate at a Faster Pace and with Greater Agility in order to Compete with other Banking and Financial Organizations and to Leverage Large Data
Let us look at another real-world example. Federal Bank India launched its AI personal assistant in 2020, which has become the trusted mascot for all of the bank’s marketing and social media. This is Google Cloud installed using Dialogflow. They had previously tried around 30 conversational AI companies and had the same issues; handling basic questions with a limited phrasing range was one big problem. As a result, these chatbots were difficult to build strategically according to Federal Bank’s objective of intelligent client contact. And training the bots took a long time, distracting their engineering staff from problem solving. The traditional platforms they tested had no automatic learning, so they had to train the bot for every inquiry. They had to train a thousand questions per month and fine-tune queries for clients. So they used Dialogflow in Google Cloud and the platform trained itself. The contextual auto-learning capabilities of Dialogflow save their development team up to five hours a day in routine tasks, time that can be devoted to designing new AI solutions. Compared to an earlier prototype solution from a different source, this digital assistant now handles 1.4 million requests each year. By 2025, the bank expects to treble its query volume. They estimate that the bot has increased overall customer satisfaction by 25 percent. By 2025, they anticipate that their bot Feddy will handle up to 4 million transactions, with a 50 percent reduction in customer service costs (Federal Bank Case Study / Google Cloud, n.d.). In another example, Elula is an Australia and New Zealand-based startup that offers artificial intelligence as a service (AIaaS) to financial services institutions (FSIs). Elula solutions are being used by FSIs to make predictions and business decisions, as well as to speed up the deployment of artificial intelligence. Banks have the financial means to invest millions of dollars in building their own AI solutions. Elula, on the other hand, offers a business-ready AI service that enables banks to avoid these fees, provide results quickly, and concentrate on their core competencies. Elula’s AIaaS capabilities are based on AWS services such as AWS Glue, an extract, transform, and load (ETL) service; Amazon EMR (Cloud based big data platform), which automates engineering for millions of rows of data; and Amazon Athena (Serverless Interactive query service), which completes data cleansing. By collaborating with AWS, Elula ensures that FSIs can protect their data assets with enterprise-grade security and privacy rules. Josh Shipman, co-founder of Elula, states, “Security is my background, and it has always been the most crucial problem.” Elula has a number of security certifications, including ISO 27001, which recognizes the company’s ongoing commitment to data security. Elula’s FSI customers now have access to one of Australia’s largest big data platforms, replete with a data science model generation pipeline. Customers of Elula think that startups like Elula are more nimble than established partners in meeting their business needs. Most
Building a successful data science ecosystem using public cloud 177 importantly, Elula’s AI offering provides meaningful financial returns 12–18 months faster than businesses could build an AI solution in-house (Elula Case Study, n.d.). 5.5
Machine Learning Reliably Predicts the Chance of Credit Default using both Quantitative and Qualitative Data
Let us look at another real-world example, ICICI Prudential’s use of the Google cloud deep learning platforms Recognic and Vision API minimizes the initial waiting period, which can result in drop-offs. That way, they know whether they need to update or submit more documents. It is not necessary to go to an underwriter for further checks if the application data match the documentation. “Google Cloud has reduced middle and back-office work, allowing us to handle 30% more applications in the same time frame without adding resources.” ICICI Prudential Life Insurance recognizes the importance of protecting customer data and is committed to ensuring its security and privacy. Google Cloud automatically deletes contact details after processing. This workflow step ensures that no data is kept during the optical character recognition process (ICICI Prudential Life Insurance Case Study / Google Cloud, n.d.).
6. CONCLUSION In conclusion, a thriving data science ecosystem to support an organization’s business needs requires on-demand access to high performing, scalable compute, storage and network connectivity while ensuring data governance. It also requires tools and technologies to collaborate easily between individuals working in a data science project, readily available solutions from data preparation to model development, continuously improving to deploying to production. It is extremely costly to procure and highly complex to integrate those applications and constantly maintain or upgrade this ecosystem using an on-premises platform. An established public cloud service provider makes all these required services available as pay per use when needed, which are easy to deploy and integrate, and are high performing and scalable. It removes the complexity of manageability from the data science users to focus on business improvements leveraging these solutions. Moreover, these solutions are also available as a private cloud offering, available to organizations whose data cannot be stored in the public cloud. Organizations should evaluate all these popular public cloud offerings before planning to create an ecosystem on their own, which may incur project failures and monetary loss due to the issues discussed in this chapter.
REFERENCES Baum, D. (2021). Cloud Data Science For Dummies®. Snowflake Special Edition. John Wiley & Sons. Bisong, E. (2019). Building Machine Learning and Deep Learning Models on Google Cloud Platform: A Comprehensive Guide for Beginners. Apress. Fregly, C. and Barth, A. (2021). Data Science on AWS: Implementing End-to-End, Continuous AI and Machine Learning Pipelines. O’Reilly Media.
178 Handbook of big data research methods Case Studies Build Machine Learning Solutions with Oracle’s Services and Tools (n.d.). Build Machine Learning Solutions with Oracle’s Services and Tools. Retrieved 25 March 2022 from https://www.oracle.com/ a/ocom/docs/build-machine-learning-solutions-cloud-essentials.pdf. Canopy Case Study – Amazon Web Services (AWS) (n.d.). Amazon Web Services, Inc.; aws.amazon. com. Retrieved 25 March 2022 from https://aws.amazon.com/solutions/case-studies/canopy-case -study/. Elula Case Study (n.d.). Amazon Web Services, Inc.; aws.amazon.com. Retrieved 25 March 2022 from https://aws.amazon.com/solutions/case-studies/elula/?did=cr_card&trk=cr_card. Federal Bank Case Study | Google Cloud (n.d.). Google Cloud; cloud.google.com. Retrieved 25 March 2022 from https://cloud.google.com/customers/federal-bank/. Forth Smart connects rural economies using Oracle Cloud (n.d.). Oracle; www.oracle.com. Retrieved 25 March 2022 from https://www.oracle.com/customers/infrastructure/forth-smart/. FWD Case Study | Google Cloud (n.d.). Google Cloud; cloud.google.com. Retrieved 25 March 2022 from https://cloud.google.com/customers/fwd/. HSBC Case Study | Google Cloud (n.d.). Google Cloud; cloud.google.com. Retrieved 25 March 2022 from https://cloud.google.com/customers/hsbc. ICICI Prudential Life Insurance Case Study | Google Cloud (n.d.). Google Cloud; cloud.google.com. Retrieved 25 March 2022 from https://cloud.google.com/customers/icici/. Moneytree KK Case Study | AWS Marketplace | AWS (n.d.). Amazon Web Services, Inc.; aws.amazon. com. Retrieved 25 March 2022 from https://aws.amazon.com/solutions/case-studies/Moneytree -AWSMarketplace/?did=cr_card&trk=cr_card. nib Group Case Study (n.d.). Amazon Web Services, Inc.; aws.amazon.com. Retrieved 25 March 2022 from https://aws.amazon.com/solutions/case-studies/nibgroup/?did=cr_card&trk=cr_card. OakNorth Case Study (n.d.). Amazon Web Services, Inc.; aws.amazon.com. Retrieved 25 March 2022 from https://aws.amazon.com/solutions/case-studies/oaknorth-case-study/?did=cr_card&trk=cr_card. Pepperstone Case Study – Amazon Web Services (AWS) (n.d.). Amazon Web Services, Inc.; aws.amazon. com. Retrieved 25 March 2022 from https://aws.amazon.com/solutions/case-studies/pepperstone-case -study/?did=cr_card&trk=cr_card.
12. How HR analytics can leverage big data to minimise employees’ exploitation and promote their welfare for sustainable competitive advantage Kumar Biswas, Sneh Bhardwaj and Sawlat Zaman
1. INTRODUCTION Almost 42 years ago, Jac Fitz-Enz introduced a radical notion of “measuring” the human resource (HR) that challenged the long-established philosophy that “HR activities cannot be measured” (cited in Caudron, 2004). This notion created much controversy and debate. Since then, a lot has changed, and it is now common for organisations to determine the expenses per employee in recruitment, turnover, and training. Organisations, in so doing, keep track of return on human capital and make sense of the value and effectiveness of their HR initiatives. The tools used for measuring costs and ensuing returns within this period too have transformed, in line with technological breakthroughs, from manual calculations to software and data-driven human resource management analytics (HRA) in recent times. HR metrics and different levels of analytics in human resource management (HRM) have been in use for a while now (Boselie, 2014; Dahlbom et al., 2020). The applications of big data analytics in organisation management, particularly HRM, have been receiving growing attention from both scholars and practitioners (Dahlbom et al., 2020; Akter and Wamba, 2016). Concurrently, increasingly HRA applications have been helping managers and HR professionals to modernise people management systems. Despite the abundance of benefits of using HRA to manage people more objectively than ever before, scholars and practitioners have been raising voices about the dark side of using the data-driven people management approach (Tambe et al., 2019; Dahlbom et al., 2020). As part of that bigger picture, we look at how HRA can leverage big data to minimise employee exploitations and promote welfare to sustain competitive advantage. In this chapter, we first articulate concepts related to HR analytics, big data, followed by discussions on how big-data-driven HR analytics can be (mis) used both for employee exploitation and employee welfare. This is followed by discussions on the key challenges in adopting big data-led HR analytics and a conclusion with future research directions.
179
180 Handbook of big data research methods
2.
CONCEPTUAL ARTICULATION OF HR ANALYTICS AND BIG DATA
2.1
HR Analytics
Human Resource Analytics (HRA) is also known as people analytics, workforce analytics, talent analytics, and human capital analytics that gathers and analyses data about the people working in an organisation (CIPD, 2021). HRA is generally defined as “the application of sophisticated data mining and business analytics techniques to the field of HR” (Vihari and Rao, 2013, p. 1). Marler and Boudreau (2017) articulated HRA as “A HR practice enabled by information technology that uses descriptive, visual and statistical analyses of data related to HR processes, human capital, organisational performance and external economic benchmarks to establish business impact and to enable data-driven decision-making” (p. 15). HRA uses data from existing HR systems, sales, IT payroll, and salary surveys (internal and external). It is worth bearing in mind that HRA as a concept is relatively new and is about statistical techniques and experimental approaches to measure the impact of HR activities (Lawler et al., 2004). HRA, therefore, is not to be confused with “HR metrics”, which measure efficiency, effectiveness or impact (Boselie, 2014). HRA offers a more refined data analysis by integrating external and internal sources and HR data. HRA can be used for diverse purposes such as recruiting, hiring, team building, evaluating performance, engaging and retaining employees, training, and associated organisational outcomes (Collins, 2013; Davenport et al., 2010; Ben-Gal, 2019). Furthermore, HRA involves using sophisticated information technology to collect, influence, modify and share reports to aid people management related to decision-making, thus placing itself in a more strategic role at the organisational level (Fitz-Enz, 2009; Kutik, 2014). Dahlbom et al. (2020) reveal that while some organisations have been advancing analytical capabilities by acquiring more expertise and tools needed, many others are still at the initial stage of systematising the fundamental HR processes and modernising their HR information systems. It is also commonly understood that there is a variance between organisations’ objectives and capabilities to deploy HR analytics. Organisations have been reaping benefits from employing big data-driven HRA by replacing the traditional, intuition-based operating mode of HR systems operations (Minbaeva, 2017; Dahlbom et al., 2020). HRA can now generate evidence-based suggestions to facilitate authentic data-driven HR decisions by adopting advanced, sophisticated technology, thus reducing subjectivity in HR-related decision-making. Some critics argue HRA is merely a “fad” (Angrave et al., 2016; Marler and Boudreau, 2017). The HR function still struggles when it comes to quantifying more complex aspects of work–life balance or differentiating between individual and team performance systematically (Scullen et al., 2000). Moreover, a scholarly concern is growing about the intention of using HRA-generated decisions as there are many instances where HRA has been used to exploit employees over welfare (Khan and Tang, 2016). 2.2
Big Data
The definition of big data analytics is constructed variedly, but the overall concept of this term constitutes a picture of what it stands for. According to IBM (2021, p. 1), “Big data analytics is the use of advanced analytic techniques against extensive, diverse big data sets that include
How HR analytics can leverage big data to minimise employees’ exploitation 181 structured, semi-structured and unstructured data, from different sources, and in different sizes from terabytes to zettabytes”. It speaks of high volume, high velocity, and wide variety as the biggest audience lies in social media, web pages, and the Internet of things that are big data sources (Akter et al., 2016). Big data allows the use of highly advanced mathematical models that make it possible to recognise associations between previously undetected phenomena that might exist unexplored. In the consumer market space, by leveraging big data on consumer shopping habits, lifestyles, and spending patterns, businesses can make solid predictions as to “what products“ consumers want, “when“ they want them and at “what“ quality and price (Akter and Wamba, 2016; Ngai et al., 2017). It is anticipated that the HR decision-making process will likely undergo a revolution using big data (Mayer-Schönberger and Cukier, 2014; Davenport, 2018). Human Resource Analytics (HRA) needs to collect and process a sophisticated amount of big data (that is, a large portion of data with volume, velocity and voluminous, variety, and value) to systematically perform all the HRM tasks from internal and external sources (McAfee and Brynjolfsson, 2012). Companies using big data-driven business decision-making are now leading the way, such as Amazon, American Express, BDO, Capital One, General Electric, T-Mobile, Starbucks, and Netflix (O’Neill, 2016). Specific theories on big data or HRA itself are still evolving. Since big data and HRA are essentially linked with strategic decision-making, some academics argue that it can be studied through the lens of some theories such as decision-making models (Mazzei and Noble, 2019), the resource-based view (RBV) (Barney, 1991). Scholars (for example, Akter et al., 2020; Mikalef et al., 2019) have been heavily promoting the use of “Dynamic Capabilities theory” (Teece et al., 1997) in the big data analytics space. Dynamic capabilities (DCs) are articulated as a “firm’s ability to integrate, build, and reconfigure internal and external competencies to address rapidly changing environments”. DCs thus reflect “an organisation’s ability to achieve new and innovative forms of the competitive advantage given path dependencies and market positions” (Teece et al., 1997, p. 516). Extending the ideas of resource-based theory (Barney, 1991), the dynamic capabilities framework enables businesses to determine the best course of action in the face of rapidly changing environments (Teece et al., 1997). In the competitive marketplace, a firm’s success is reliant on how effectively and efficiently it can manage its non-imitable and non-replaceable human resources contributing to product innovations and overall firm performance (Barrales-Molina et al., 2015; Wamba et al., 2017). This amalgamation is needed to understand better how big data-driven HR analytics are shaping strategic decision-making (Mazzei and Noble, 2019) and overall management practices. Through big data and HRA, the ever-struggling relationship between HR and corporate strategists can finally meet at a point at which to deliver expected results (Soundarajan and Singh, 2016; Hajkowicz, 2015). We posit that the DCs framework is the basis of firms’ abilities to renew internal and external competencies leading towards developing a set of sustainable competitive advantages, the extent to which organisations are able and willing to utilise big data and HRA in tandem. Dahlbom et al. (2020) argue that HRA can offer insight for business-driven decision-making if approached with an open mind and appropriate set of analytical tools. Others, such as Tambe et al. (2019), have raised concern over ethical and legal aspects of using HRA to manage people. On many occasions, biased algorithms used in AI, and machine learning can lead to employee exploitation over welfare, thus warranting that HRA needs to be securitised closely.
182 Handbook of big data research methods
3.
LITERATURE SEARCH TECHNIQUES
We have explored an under-researched area of study and hence adopted a qualitative approach in our literature search (Creswell, 2008; Saunders and Townsend, 2016). The evidence offered in this study is collected from published sources from both academic journals and other secondary sources (Bell et al., 2018). These articles and other sources were identified using keywords such as “big data analytics”, “Human Resource Management analytics” and “HR analytics” using Elsevier, Google Scholar, Springer, and Sage databases. Our discussions were informed through content analysis (Hsieh and Shannon, 2005; Elo et al., 2014; Erlingsson and Brysiewic, 2017) and thematic analysis (Eriksson and Kovalainen, 2015). Based on our literature search, we present how big-data-driven HR analytics can be (mis)used for employee exploitation, followed by a section on how HRA can be used to ensure the welfare of employees.
4.
HOW BIG-DATA-DRIVEN HR ANALYTICS CAN BE USED IN THE AREA OF EMPLOYEE EXPLOITATION
In the wake of a constantly changing external environment such as the COVID-19 pandemic, the focus of the HR function is becoming more strategic than ever before. This changing situation requires the HR department to develop people with the right competencies to tackle unforeseen challenges in the future. To play a strategic role, HR practices have been leveraging HRA to make solid predictions and prescriptions for the organisation to remain competitive. Algorithms and data-driven HRA facilitates HR practitioners to make objective decisions by reducing the likelihood of subjective biases in HR decisions (Tambe et al., 2019). With an abundance of structured and unstructured private and public data, the revolution in information technology has contributed to the development and adoption of HRA systems (Davenport et al., 2010; Kutik, 2014). HRA helps improve organisational performance through HR planning, predicting skill demand, identifying attribution and its causes, and evaluating job performance (Ben-Gal, 2019; Collins, 2013). For example, IBM uses HR algorithms to identify the most appropriate training programmes required for a particular employee based on the experiences and performance of other employees in similar roles (IBM, 2021). IBM’s Smarter Workforce Institute reveals that their artificial intelligence-enabled HRA system has been helping the HR department solve business challenges. HRA is enabling IBM to attract employees, develop new skills, and improve employee experiences, thus making more efficient use of HR budgets to remain competitive and innovative. The ethical concern is also at the same time mounting regarding how HRA is handling big data, which is predominantly sourced from people’s personal data, and whether big data fed into the HRA system is used for employee exploitation or welfare (Lawler et al., 2004). Furthermore, it is commonly believed that technology-enabled HR systems can perform many traditional HR tasks efficiently with less influence from human biases. However, research indicates that HRA can be used to manipulate HR practices that can disadvantage classified groups such as ethnic minorities and females (Tambe et al., 2019; Lengnick-Hall et al., 2018). The following section provides a discussion on how big-data-driven HRA can be used for employee exploitation over welfare in the key functional areas of HR practices.
How HR analytics can leverage big data to minimise employees’ exploitation 183 4.1
Exploitation of HRA in the Process of Recruitment and Selection
To realise an organisation’s strategic goals, HR managers need to hire the right people for the right job at the right time. Inappropriate hiring can cost organisations in several ways. For example, suppose the right people are not hired for the right job; that can cause organisations to lag behind their competitors in innovating new products. To hire the most suitable candidates, machine learning (ML) and artificial intelligence (AI) enabled HRA has been coming into play to accelerate recruitment activities more efficiently, such as CV screening, reference-checking and initial shortlisting of job-seekers for progressing to the recruitment phase (Johansson and Herranen, 2019). HRA is increasingly being used by small to large organisations such as Manatal, Monday.com, Breezy HR, Arya, HireVue, and Plum in order to screen applicants and complete the hiring process with little or sometimes no human involvement in this costly process (Boudreau, 2017). It seems that HRA can tackle the entire recruitment process more efficiently without human interventions; however, given that the HRA system relies heavily on the internal and external data fed into the system by humans, this can potentially inject some forms of human bias into the system through manipulation of the codes and dataset. For example, in 2018, a machine learning specialist at Amazon.com discovered that since 2014, Amazon’s HRA system had been systematically discriminating against females while hiring software developers. Amazon’s investigative team later revealed that the HRA of Amazon used to screen out job applications by observing a pattern in résumés submitted into the system over the last 10 years, and the majority of those were received from males due to the historically male-dominant tech industry. Using historical training data, Amazon’s HRA-driven CV screening process automatically excluded those résumés containing words such as “women”, “women’s chess club captain”, “women’s college” in the recruitment process (Reuters, 2018). Kang et al. (2016) found that the résumé screening process in many US organisations preferred “White names” over “Non-white names”, which discriminated against minorities. They further revealed that so-called pro-diversity employers such as “equal opportunity employers” or “minorities are strongly encouraged to apply” tend to discriminate more against their ethnic minority applicants. Though many ethnic minorities felt it was safer to disclose their racial identity on their résumés to these so-called pro-diversity employers, these pro-diversity employers were found to be more unfavourable towards ethnically diverse job-seekers. It was evident that recruitment bias happened during the job analysis stage to identify the need for a role in the organisation that permeated the hiring process later. Such tendencies can cascade down to job advertising by attaching subtle disparate expectations to the job description and selection criteria, which can dissuade many suitable talents from applying for the posted job. In some instances, using prior KSAOs (knowledge, skills, abilities, and other traits) and person–environment fit, some employers use HRA to predict candidates’ future onboarding and retention cost. With applicant tracking systems (ATSs) from the submitted résumés, many employers discount potential applicants’ abilities from fitting into the organisational team and culture (Mujtaba and Mahapatra, 2019). Such biased predictions result in invalid ranking, thus excluding many potential talents from being invited to the next selection phase (Snell, 2006). A study conducted by Northeastern University and USC scholars aimed to explore whether AI-enabled advertisements posted on Facebook are bias-free. They revealed that cashier roles in a supermarket were most shown to an audience of 85 percent women, whereas taxi driving ads popped up to approximately 75 percent African Americans (Bogen, 2019). All the examples and research findings explain how HRA can be exploited to disadvantage one particular
184 Handbook of big data research methods group over another. The following section discusses how HRA can be exploited in training, development, and career progression. 4.2
Training, Development and Career Progression
Today, more organisations rely on HRA to quantify, monitor and engage their employees, leading to higher individual and organisational performance (Giermindl et al., 2022) than ever before. For example, to better engage employees, Microsoft and IBM have been using Office 365 Workplace Analytics and Watson Talent Insights, respectively, while SAP leverages Success Factors People Analytics (SFPA). The main aim of HR development is to identify the current gap in training and knowledge in order to design appropriate training and developmental programmes enhancing individual and organisational performance (Maity, 2019). Furthermore, HRA helps organisations forecast future human resource needs by performing a systematic gap analysis (Rex et al., 2020). There has been a more significant push for organisations to develop talent internally to design succession planning and meet future skill shortages. For example, Goldman Sachs has its own talent mentoring programme to build and retain talents to meet future leadership roles (Kaplan et al., 2018). However, all benefits of HRA cannot be realised if HR systems are fed with biased data and run by manipulative algorithms that favour one particular group over others. For example, Prassl (2018) found that Uber’s performance analytics favoured male drivers over females. Consequently, female drivers receive less pay for being slow drivers on the same route that their male counterparts use. Research indicates that though seemingly invisible, disparate discrimination can disadvantage females over males to participate in executive development programmes (Köchling and Wehner, 2020; Raub, 2018). For example, to participate in the organisational development programme, if ten years of continuous full-time employment is programmed into HRA as a condition, then HRA, without being biased towards any groups, will select only those people who can satisfy this continuous job requirement. In most cases, it is evident that continuous employment is unlikely to be satisfied by working mothers or women who have children and caring responsibilities. Such disparate treatment directly impacts female employees’ motivation and work commitment, job performance, and affects their participation in the management team (Yam and Skorburg, 2021), arguably contributing to a gender imbalanced senior management team. The above empirical evidence demonstrates how HRA can be exploited to deprive a target group from advancing their careers. The following section provides an account of how HRA can exploit the performance appraisal system. 4.3
Maintenance and Performance Appraisal
More organisations have been using HRA to assess employee performance effectively, determine appropriate rewards, and manage career development and succession planning (Shet et al., 2021). Research indicates that a fair and transparent performance appraisal is a precondition to ensure distributional and procedural justices, thus leading the organisations to have a satisfied workforce to accelerate job performance (Shet et al., 2019; Aydiner et al., 2019). Historically, the HR department has relied on the bell curve performance appraisal processes contributing to subjective biases such as leniency, strictness, and control tendency (Doellgast and Marsden, 2019). HRA empowers supervisors to make data-driven performance appraisals to address all these reported shortcomings while enabling employees to self-monitor their
How HR analytics can leverage big data to minimise employees’ exploitation 185 own performance against the organisational standard (Waters et al., 2018). One of the key objectives of the performance appraisal process is to provide employees with developmental opportunities through self-appraisal and reflections. With HRA, employees can also access their real-time performance data, rewards, recognition and career trajectory (Hamilton and Sodeman, 2020). Such real-time self-appraisal can empower employees on the job to enhance job satisfaction and retention (Shet et al., 2021). Data-driven HRA can be a great supplement to advance mutual consultations between employees and supervisors on designing appropriate training and development programmes, job rotations and mentoring programmes (Andersen, 2017). Despite a win–win use of HRA, there are many instances where HRA can be used to exploit employees instead of enhancing their well-being. For example, an investigative report released by the Australian Broadcasting Corporation (ABC) on 27 February 2019 revealed that Amazon Australia’s warehouse staff have been constantly timed and monitored, as with their US warehouse operations. In Australian operations, employees are required to pack products within the time frame and instructions generated by algorithms. Many of Amazon’s Australia employees complained that “the workplace is built around a culture of fear where their performance is timed to the second”. They added that: high-pressure targets make them feel like they can’t go to the toilet and sometimes push them to cut safety corners; they can be sent home early without being paid for the rest of their shift when orders are completed; and everyone is employed as a casual and constantly anxious about whether they’ll get another shift. (ABC, 2019, p. 1).
All the above real examples and cases provide an account of how HRA can be used to manipulate the performance appraisal system to exploit employees; many scholars termed this exploitation “modern-day slavery”. The following section discusses how HR analytics can be used to facilitate welfare over exploitation in the workplace.
5.
HOW HR ANALYTICS CAN BE USED FOR WELL-BEING OVER EXPLOITATION
Significant progress has been made in HR practices over the last three decades; however, critics argue that the search for an association between HRM practices and employee performance has been pursued at the expense of employee well-being (Guest, 2017). The advent of disruptive technologies such as artificial intelligence, machine learning, blockchain technologies, data mining, and the Internet of Things (IoT) have accelerated data-driven HR decision-making. Consequently, HRA can be used to facilitate data-driven and well-being oriented HR decision-making (Huselid, 2018; Davenport, 2018; Guest, 2017; Boudreau and Cascio, 2017). If used for promoting employee well-being, HRA can offer many novel solutions, for instance, by generating real-time employee engagement data. Such real-time engagement data generates authentic suggestions for HR managers to determine commensurate compensation and career advancement-related decisions to enhance job satisfaction (Bakhru and Sharma, 2022). HRA has been reliant on comprehensive people-related data to make better predictions about people and their capabilities. HRA helps managers identify skill gaps in people to design the right training programmes for uplifting skills and to better prepare them for the future (Guest, 2017). For example, IBM’s employees are provided with an opportunity to advance
186 Handbook of big data research methods their careers based on the recommendations provided by IBM’s HR analytics system. IBM’s Blue Match software generates career advancement recommendations based on employees’ interests, experiences, prior training and performance track record. It was found that 27 per cent of IBM’s employees changed their jobs in 2018 based on the recommendations of IBM’s HR analytics (Tambe et al., 2019). This business case of IBM provides one of the many successful examples of how HRA can be used ethically to enhance employee welfare as a win– win approach for the organisation and position the organisation as an employer of choice to existing and potential employees (IBM, 2021). Based on an extensive review of the literature, we posit that the following steps accelerate employee welfare using HRA. 5.1
Create Analytics Supportive of Inclusive Organisational Culture
Adoption of HRA can significantly reduce the likelihood of subjective interpretations of human decisions, thus promoting inclusiveness and diversity to be at the forefront in managing people in organisations. For example, Google’s Oxygen Project has been trying to answer the question of what makes a good manager (Bock, 2015). This project was launched way back in 2008 to train future leaders to set up the best HR practices to drive innovation and employee well-being. Over the years, Google HRA-driven suggestions have been helping managers improve in many key HR metrics, such as employee turnover, performance and job satisfaction with a track record of success. HRA can facilitate making transparent decisions to promote meritocracy over favouritism through inclusiveness and diversity (Kryscynski et al., 2018). For example, McKinsey’s study (2020) on 15 countries and over 1000 firms documents that those organisations in the top quartile filled with diverse ethnic teams outperformed their counterparts at the bottom by 25–36 per cent. They further posit that leaders who accommodate diverse talents and consider multiple perspectives are more likely to enable their organisations to overcome the COVID-19 crisis better than their counterparts who are slower in implementing inclusive practices (Buengeler et al., 2018). However, the adoption and implementation of HRA throughout the organisation is unlikely to bring out any positive organisational outcomes if the top management, such as the board and CEO, fail to understand the strategic roles of HRA. Deloitte’s 2017 study on High-Impact People Analytics revealed that 69 per cent of their sampled organisations with 10 000 or more employees have leveraged HRA to make people management-related decisions. However, there is a criticism that 20 per cent of HR teams still lack data analytic skills to produce useful reports. This shows how analytics support of an inclusive organisational culture at the top is crucial to instilling fairness and transparency in HR decisions to accelerate innovation and employee welfare. 5.2
Collect Data to Run HR Analytics Ethically
Due to the complexity and volume of big data in HR, the possibility of taking time to pull information from multiple datasets can be a daunting task. Also, the quality of data analysis depends on the quality of collected data; without a good input, the output will be unreliable (Rasmussen and Ulrich, 2015). It is important to note here that data analyses need to be used to extract meaningful insights to drive well-informed decisions (Andersen, 2017). Of particular concern are the questions raised about HR professionals’ ability to interpret data trends effectively and derive meaningful insights from them (Angrave et al., 2016). Hence, effective
How HR analytics can leverage big data to minimise employees’ exploitation 187 data governance is necessary to guarantee the quality of the data extracted and examined from massive datasets (Hashem et al., 2015). Despite an abundance of big data, getting the systematically right data for HRA to generate unbiased suggestions poses a significant challenge. Andersen (2017) identifies several key challenges in collecting and synchronising reliable and valid data through using HRA. First, data access constitutes a considerable challenge because many HR processes cannot be managed with the aid of technology. Thus, substantial effort is required in collating data (Barrett et al., 2015). For instance, there may be a need to collect employees’ personal data from informal channels such as social media posts, email messaging, and interaction with customers, colleagues and supervisors. This can help HRA map out employees’ performance or predict the likelihood of their leaving the organisation (Sparrow et al., 2015). Outsourcing of private data collection would be less time-consuming and cost-effective; however, such a practice could compromise privacy and data integrity (Minbaeva, 2017). On ethical grounds, some personal data are so sensitive that authorisation is required as to who will view this information. For instance, information about salary pay-outs, increments and performance evaluations is considered highly confidential, and even the data analysts will need to be very careful when using these data (Guru et al., 2021). Second, amidst the availability of multiple big data analytics tools, HR managers need to decide which tools will be best suited to resolve the organisation’s HR issues. This decision concerning data collection tools and consequential data accuracy is of utmost importance since many crucial HR-related decisions will be based on it. Third, wrongly entered data and duplicity are issues that can generate invalid results (Minbaeva, 2017). Fourth, decentralised data collection can make it challenging to integrate diversely collected datasets, with some of the discrepancies occurring because of metrics and timeframes of a longitudinal nature (Minbaeva, 2017). The positioning of the HR departments as decentralised units within organisational hierarchies compounds these challenges even more due to data coordination issues (Angrave et al., 2016). Finally, HR analytics can face bottlenecks if left unattended or carelessly handled in the initial data collection stage. This can create ethical challenges and break trust in HR analytics itself. However, suppose the ethical challenges can be overcome. In that case, HRA can generate suggestions for the organisation to tackle strategic and tactical challenges related to productivity, innovation, absenteeism, and employee turnover. A 2011 MIT study on HR analytics found that top-performing organisations use analytics five times more than underperforming organisations (LaValle et al., 2011). HR analytics can help organisations deal with a competitive landscape at tactical and strategic levels. At a strategic level, typical competitive challenges faced by organisations include productivity, innovation, global scaling, and lean delivery. The tactical challenges that HRA can help HR to resolve are by justifying the return on investment on employee training in terms of productivity. Not only that, to ensure the supply of the right number of talents for the organisation, HRA can help predict which new hires will fit into the organisational culture and become leading performers and which ones will leave the organisation in the next six to twelve months. Also, with HRA, HR managers can primarily work out “which employees have the highest potential for progression and leadership?” Traditionally, HR managers apply their “gut feelings” or “subjective judgements” to solve many people-related challenges. We suggest aligning HR functions vertically and horizontally using a solid data-driven approach rooted in HRA (Soundararajan and Singh, 2016). However, HR is unlikely to deliver the right predictive decisions from advanced people
188 Handbook of big data research methods analytics if the questions used to gather employees’ personal and performance data suffer from validity and reliability issues and are not sourced ethically. An organisation’s strong commitment to abide by the standard ethical protocol can encourage people to respond accurately to HR-related issues, including job satisfaction and work engagement. This can result in framing time-bound solutions to the problems identified. The following step presents how to remove confirmation and unconscious biases in data analysis and interpretations. 5.3
Removing Confirmation and Unconscious Biases in Data Analysis and Interpretations
Data analysis and data modelling are critical stages in determining how well HRA can be used in HR practices. Any lacunae in the data analysis can lead to incorrect or biased HR outcomes that can jeopardise the entire HR policy and practices (Deloitte, 2015). Sarker (2021) posit that experimental approaches, analytical models, data-validity measures (both output and input variables), and dimensions would need to be established beforehand and in an unambiguous way for robust HRA-driven decision-making. Using data of inadequate quality for HR policy-making would be of little use. Moreover, inadequately analysed quality data will also make biased decisions (Minbaeva, 2017). HRA has led to unprecedented progress towards data-driven, objective decision-making; however, even ethically sourced data obtained from internal and external sources may not bring optimum benefits if HRA systems suffer from confirmation and unconscious biases (Houser, 2019). Confirmation bias may occur when the data analyst rejects those results that contradict their own stereotyped beliefs, values, and established hypotheses and only accepts those that support his/her worldview (Woo et al., 2017). In addition, the choice and use of wording to interpret results can be influenced by the data analyst’s personal beliefs and attitudes towards the event or others. Take examples from two hypothetical statements: only 50 per cent of female employees working from home reported that they were happy with their work–life balance; a full 50 per cent of female employees working from home said they are happy with their work–life balance. The former statement implies that 50 per cent of female employees are not happy with their work–life balance while the latter conveys the same thing with a slightly rosier outlook. To avoid this kind of interpretation bias, the report generated by HRA should use a baseline data point to compare findings against the industry average and the organisation’s past track records (Joshi, 2017). The following step sheds more light on how regular algorithm audits carried out by third parties can reduce the likelihood of employee exploitation by HRA. 5.4
Ongoing Algorithm Audits for HRA Systems and Soliciting Blind CVs
Data and algorithms-driven HRA systems can enable HR managers to avoid prejudice against people for critical HR activities such as hiring, performance appraisals, and making promotional decisions. On many occasions, HRA facilitates HR managers in making objective and impartial decisions that subjective human biases would otherwise influence. However, manipulative algorithms and flawed training data fed into the HR system can result in biased decision-making. A study conducted by the University of Melbourne aimed to understand whether artificial intelligence-enabled HRA discriminates against female job-seekers in the finance industry (University of Melbourne, 2020). The findings of this study revealed that
How HR analytics can leverage big data to minimise employees’ exploitation 189 HRA-generated suggestions favoured males over females in making hiring decisions due to the use of gender-skewed training data and biased algorithms. One of the HR managers narrated this finding as: “something distinct about the men’s CVs that made our panel rank them higher, beyond experience, qualification and education.” Experts suggest that applicants remove demographic data such as race and nationality from their CVs. Such omission of demographic data allows HRA to minimise initial screening biases. For example, the BBC and the Guardian have used third parties like GapJumpers to make transparent hiring decisions (The Guardian, 2021). GapJumpers has overhauled the hiring process by dropping CVs from the hiring process and, instead, using workplace simulation tests mimicking the challenges of the job itself. Based on the score received on this bespoke test, an interview is organised to facilitate the practice of the objective hiring process. Despite all these proactive approaches to minimise biases, there have been criticisms that the HRA industry is still ill-equipped, as HRA rooted in the big data is too big to be captured, stored, and managed systematically to derive unbiased outcomes (Angrave et al., 2016; Boudreau and Ramstad, 2007; Rasmussen and Ulrich, 2015; Engler, 2021). Though the application of HRA is receiving growing recognition in many organisations, many are still experiencing a shortage of talents who can successfully use HRA systems to generate unbiased results. The scarcity of publicly available case studies demonstrating an organisation’s use of big data in HRA provides little evidence of the strategic use of HRA (Angrave et al., 2016; Rasmussen and Ulrich, 2015; Huselid, 2018). To minimise biased outcomes originating from the improper use of HRA, we suggest that organisations undergo regular auditing of their HRA system along with the training data and algorithms used to generate HR decisions. Such auditing can have several implications to minimise biases in HR decisions in order to benefit both internal employees and external stakeholders (Etukudo, 2019; Engler, 2021). We suggest that an ongoing fairness check of the HRA systems can discover the sources of underlying biases related to people, training data or algorithms. Regular auditing can help HR departments generate authentic organisational decisions, thus promoting meritocracy over favouritism in all HR functions. Such merit-based HR practices can give positive signals to employees and protect employers from potential lawsuits for the non-compliance of local laws such as equal employment opportunity.
6. CONCLUSION Many hope that algorithms will help human decision-makers avoid prejudices by adding fairness and consistency in the hiring process. Data leveraged HRA can help identify and remove implicit human biases in decision-making by reducing the likelihood of subjective interpretations of human decisions. However, there is a question concerning whether or not HRA can eliminate human bias. There is an array of evidence indicating that indiscriminate adoption of AI-embedded HRA can give birth to new sorts of biased decisions, thus posing numerous ethical and legal challenges. Biases can permeate HRA in several stages, such as while collecting training data that are sometimes faulty and incomplete and human coding errors in algorithms. In many cases, unintentional HR decisions can disadvantage certain groups, such as ethnic minorities and females. As a natural tendency, we all possess some biases; however, we can minimise many of our subconscious biases by sticking to ethical integrity and professional standards with regular algorithm auditing. We suggest that effective implementation
190 Handbook of big data research methods of HRA in the organisation needs to be supported by all organisational members led by top management to ensure that HR practices such as recruitment, selection, performance appraisal, training and development and promotion are oriented towards well-being. Proactive and reactive approaches to minimise HRA biases
Table 12.1 Number
Proactive measures to minimise HRA biases
Reactive measures to minimise HRA biases
1
Create HRA supportive organisational culture
Perform regular independent audits to detect biases in the
initiated and supported by the top management
training data and algorithms
Ethically collect reliable and valid data from both
Review data collection instruments to ensure their validity and
internal and external sources to feed into HRA
reliability
Remove confirmation biases and introduce
Be open to taking feedback from all stakeholders, including
non-manipulative training data and algorithms in
key organisational members, on how to improve HRA
HRA
systems regularly
Remove personal stereotypes and unconscious
Communicate review outcomes with all stakeholders and
biases in data interpretation
reflect on outcomes
Act based on authentic suggestions generated by
Implement suggested systems improvement for broader
HRA
acceptance of HRA among key stakeholders
2 3
4 5
In conclusion, merit-based supportive HR practice in the organisation is a win–win case for both the organisation and its members to grow and survive in the competitive world. Based on the extant research review, we suggest that the above proactive and reactive approaches presented in Table 12.1 should be used as a set of guidelines to minimise biases in HRA systems.
REFERENCES ABC (2019). They resent the fact I’m not a robot. Available at https://www.abc.net.au/news/2019-02-27/ amazon-australia-warehouse-working-conditions/10807308. Akter, S. and Wamba, S.F. (2016). Big data analytics in E-commerce: A systematic review and agenda for future research. Electronic Markets, 26(2), 173–94. Akter, S., Wamba, S.F., Gunasekaran, A., Dubey, R. and Childe, S.J. (2016). How to improve firm performance using big data analytics capability and business strategy alignment? International Journal of Production Economics, 182, 113–31. Akter, S., Motamarri, S., Hani, U., Shams, R., Fernando, M., Babu, M.M. and Shen, K.N. (2020). Building dynamic service analytics capabilities for the digital marketplace. Journal of Business Research, 118, 177–88. Andersen, M.K. (2017). Human capital analytics: The winding road. Journal of Organizational Effectiveness: People and Performance, 4(2), 133–36. Angrave, D., Charlwood, A., Kirkpatrick, I., Lawrence, M. and Stuart, M. (2016). HR and analytics: Why HR is set to fail the big data challenge. Human Resource Management Journal, 26(1), 1–11. Aydiner, A.S., Tatoglu, E., Bayraktar, E., Zaim, S. and Delen, D. (2019). Business analytics and firm performance: The mediating role of business process performance. Journal of business research, 96, 228–37. Bakhru, K.M. and Sharma, A. (2022). Unlocking drivers for employee engagement through human resource analytics. In Research Anthology on Human Resource Practices for the Modern Workforce (pp. 471–90), IGI Global. Barney, Jay (1991). Firm resources and sustained competitive advantage. Journal of Management, 17(1), 99–120. Barrales-Molina, V., Montes, F.J.L. and Gutierrez-Gutierrez, L.J. (2015). Dynamic capabilities, human resources and operating routines: A new product development approach. Industrial Management & Data Systems, 115(8), 1388–411.
How HR analytics can leverage big data to minimise employees’ exploitation 191 Barrett, M., Davidson, E., Prabhu, J. and Vargo, S.L. (2015). Service innovation in the digital age. MIS Quarterly, 39(1), 135–54. Bell, E., Bryman, A. and Harley, B. (2018). Business Research Methods. Oxford: Oxford University Press. Ben-Gal, H.C. (2019). An ROI-based review of HR analytics: practical implementation tools. Personnel Review, 48(6), 1429–48. Bock, L. (2015). Work Rules!: Insights from Inside Google that will Transform how you Live and Lead. New York: Twelve Books. Bogen, M. (2019). All the ways hiring algorithms can introduce bias. Harvard Business Review, 6 May. Available at https://hbr.org/2019/05/all-the-ways-hiring-algorithms-can-introduce-bias. Boselie, P. (2014). Strategic Human Resource Management: A Balanced Approach, 2nd edn. Maidenhead: McGraw-Hill Education. Boudreau, J. (2017). HR must make people analytics more user-friendly. Harvard Business Review, June, pp. 1–5. Boudreau, J. and Cascio, W. (2017). Human capital analytics: Why are we not there? Journal of Organizational Effectiveness: People and Performance, 4(2), 119–26. Boudreau, J. and Ramstad, P. (2007). Beyond HR: The New Science of Human Capital. Boston, MA: HBR Press. Buengeler, C., Leroy, H. and De Stobbeleir, K. (2018). How leaders shape the impact of HR’s diversity practices on employee inclusion. Human Resource Management Review, 28(3), 289–303. Caudron, S. (2004). Jac Fitz-enz, metrics maverick. Workforce.com. CIPD (2021). People Analytics. Accessed at People Analytics/Factsheets/CIPD. Collins, M. (2013). Change your company with better HR analytics. Harvard Business Review, December. Creswell, J. (2008). Research Design: Qualitative, Quantitative, and Mixed Methods Approaches. Thousand Oaks, CA: Sage. Dahlbom, P., Siikanen, N., Sajasalo, P. and Jarvenpää, M. (2020). Big data and HR analytics in the digital era. Baltic Journal of Management, 15(1), 120–38. Available at https://doi.org/10.1108/BJM -11-2018-0393. Davenport, T.H. (2018). From analytics to artificial intelligence. Journal of Business Analytics, 1(2), 73–80. Davenport, T.H., Harris, J. and Shapiro, J. (2010). Competing on talent analytics. Harvard Business Review, 88(10), 52–8. Deloitte (2015). HR and people analytics. Stuck in neutral. Available at https://www2.deloitte.com/ us/en/insights/focus/human-capital-trends/2015/people-and-hr-analytics-human-capital-trends-2015 .html. Doellgast, V. and Marsden, D. (2019). Institutions as constraints and resources: Explaining cross‐national divergence in performance management. Human Resource Management Journal, 29(2), 199–216. Elo, S., Kaarianinen, M., Kanste, O., Polkki, R., Utriainen, K. and Kyngas, H. (2014). Qualitative Content Analysis: A focus on trustworthiness. Sage Open, 4, 1–10. Engler, A. (2021). Auditing employment algorithms for discrimination. Brookings Institute, Center for Technology Innovation. Available at https://www.brookings.edu/research/auditing-employment -algorithms-for-discrimination/. Eriksson, P. and Kovalainen, A. (2015). Qualitative Methods in Business Research, 2nd edn. London: Sage Publications. Erlingsson, C. and Brysiewic, P. (2017). A hands-on guide to doing content analysis. African Journal of Emergency Medicine, 7(3), 93–9. Etukudo, R. (2019). Strategies for using analytics to improve Human Resource Management Doctoral dissertation, Walden University. Fitz-Enz, J. (2009). Predicting people: From metrics to analytics. Employment Relations Today, 36(3), 1–11. Giermindl, L.M., Strich, F., Christ, O., Leicht-Deobald, U. and Redzepi, A. (2022). The dark sides of people analytics: Reviewing the perils for organisations and employees. European Journal of Information Systems, 31(3), 410–35.
192 Handbook of big data research methods Guest, D.E. (2017). Human resource management and employee well-being: Towards a new analytic framework. Human Resource Management Journal, 27(1), 22–38. Guru, K., Raja, S., Umadevi, A., Ashok, M. and Ramasamy, K. (2021). Modern approaches in HR analytics towards predictive decision-making for competitive advantage. In Artificial Intelligence, Machine Learning, and Data Science Technologies (pp. 249–67). CRC Press. Hajkowicz, S. (2015). Global Megatrends: Seven Patterns of Change Shaping Our Future. Csiro Publishing. Hamilton, R.H. and Sodeman, W.A. (2020). The questions we ask: Opportunities and challenges for using big data analytics to strategically manage human capital resources. Business Horizons, 63(1), 85–95. Hashem, I.A.T., Yaqoob, I., Anuar, N.B., Mokhtar, S., Gani, A. and Khan, S.U. (2015). The rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47, 98–115. Houser, K.A. (2019). Can AI solve the diversity problem in the tech industry: Mitigating noise and bias in employment decision-making. Stanford Technology Law Review, 22, 290. Hsieh, H.F. and Shannon, S.E. (2005). Three approaches to qualitative content analysis. Qualitative Health Research, 15(9), 1277–88. Huselid, M.A. (2018). The science and practice of workforce analytics: Introduction to the HRM special issue. Human Resource Management, 57(3), 679–84. IBM (2021). Big Data Analytics. Available at Big Data Analytics/IBM. Johansson, J. and Herranen, S. (2019). The application of Artificial Intelligence (AI) in Human Resources Management. Business Administration thesis, Jonkoping University, Sweden. Joshi, N. (2017). How to avoid bias in data analytics. HRM. Available at https://www.hrmonline.com.au/ technology/avoid-bias-data-analytics/. Kang, S.K., DeCelles, K.A., Tilcsik, A. and Jun, S. (2016). Whitened résumés: Race and self-presentation in the labor market. Administrative Science Quarterly, 61(3), 469–502. Available at https://doi.org/10 .1177/0001839216639577. Kaplan, M., Lawson, K.A. and McCrady, V. (2018). The hero viewpoint and the perception of mentors: Why Millennials need mentors and why they have problems finding them. Academy of Business Research Journal, 2, 36–52. Khan, S.A. and Tang, J. (2016). The paradox of human resource analytics: Being mindful of employees. Journal of General Management, 42(2), 57–66. Köchling, A. and Wehner, M.C. (2020). Discriminated by an algorithm: A systematic review of discrimination and fairness by algorithmic decision-making in the context of HR recruitment and HR development. Business Research, 1–54. Kryscynski, D., Reeves, C., Stice‐Lusvardi, R., Ulrich, M. and Russell, G. (2018). Analytical abilities and the performance of HR professionals. Human Resource Management, 57(3), 715–38. Kutik, B. (2014). Predictive analytics dominates my first HR Conference. Human Resource Executive Online. Available at http://www.hreonline.com/HRE/view/story.jhtml?id=. LaValle, S., Lesser, E., Shockley, R., Hopkins, M.S. and Kruschwitz, N. (2011). Big data, analytics and the path from insights to value. MIT Sloan Management Review, 52(2), 21–32. Lawler, E., Levenson, A. and Boudreau, J.W. (2004). HR metrics and analytics: Use and impact. Human Resource Planning, 27(4), 27–35. Lengnick-Hall, M.L., Neely, A.R. and Stone, C.B. (2018). Human resource management in the digital age: Big data, HR analytics and artificial intelligence. In P. Novo Melo and C. Machado (eds), Management and Technological Challenges in the Digital Age (pp. 1–30). CRC Press. Maity, S. (2019). Identifying opportunities for artificial intelligence in the evolution of training and development practices. Journal of Management Development, 38(8), 651–63. Available at https://doi .org/10.1108/JMD-03-2019-0069. Marler, J.H. and Boudreau, J.W. (2017). An evidence-based review of HR analytics. International Journal of Human Resources Management, 28(1), 3–26. Mayer-Schönberger, V. and Cukier, K. (2014). Big Data – A Revolution That Will Transform How We Live, Work and Think. New York: Houghton Mifflin Harcourt. Mazzei, M.J. and Noble, D. (2019). Big Data and strategy: Theoretical foundations and new opportunities. In B. Orlando (ed.), Strategy and Behaviors in the Digital Economy. London: IntechOpen. DOI: 10.5772/intechopen.84819.
How HR analytics can leverage big data to minimise employees’ exploitation 193 McAfee, A. and Brynjolfsson, E. (2012). Big Data: The management revolution: Exploiting vast new flows of information can radically improve your company’s performance. But first you’ll have to change your decision-making culture. Harvard Business Review, October. McKinsey & Company (2020). Diversity wins: How inclusion matters. Available at https:// www .mckinsey.com/featured-insights/diversity-and-inclusion/diversity-wins-how-inclusion-matters. Mikalef, P., Boura, M., Lekakos, G. and Krogstie, J. (2019). Big data analytics capabilities and innovation: The mediating role of dynamic capabilities and moderating effect of the environment. British Journal of Management, 30(2), 272–98. Minbaeva, D.B. (2017). Building credible human capital analytics for organisational competitive advantage, Human Resource Management, 57(3), 701–13. Mujtaba, D.F. and Mahapatra, N.R. (2019). Ethical considerations in AI-based recruitment. 2019 IEEE International Symposium on Technology and Society (ISTAS), Medford, MA, 1–7. Available at doi: 10.1109/ISTAS48451.2019.8937920. Ngai, E.W., Gunasekaran, A., Wamba, S.F., Akter, S. and Dubey, R. (2017). Big data analytics in electronic markets. Electronic Markets, 27(3), 243–5. O’Neill, E. (2016). 10 companies that are using big data. ICAS.com. Available at https://www.icas .com/thought-leadership/technology/10-companies-using-big-data#:~:text=1%20Amazon.%20The %20online%20retail%20giant%20has%20access,Capital%20One.%20Marketing%20is%20one %20of%20the%20. Prassl, J. (2018). Humans as a Service: The Promise and Perils of Work in the Gig Economy. Oxford: Oxford University Press. Rasmussen, T. and Ulrich, D. (2015). Learning from practice: How HR analytics avoids becoming a fad. Organizational Dynamics, 44(3), 236–42. Raub, M. (2018). Bots, bias and big data: Artificial intelligence, algorithmic bias and disparate impact liability in hiring practices. Arkansas Law Review, 71, 529. Reuters (2018). Amazon scraps secret AI recruiting tool that showed bias against women. Available at https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G. Rex, T., Bhattacharya, S., Narayanan, K. and Budhwar, P. (2020). Opportunities and barriers in the practice of human resource analytics. In P. Kumar, A. Agrawal and P. Budhwar (eds), Human & Technological Resource Management (HTRM): New Insights into Revolution 4.0. Bingley: Emerald Publishing. Sarker, I.H. (2021). Data science and analytics: An overview from data-driven smart computing, decision-making and applications perspective. SN Computer Science, 2, 377. Available at https://doi .org/10.1007/s42979-021-00765-8. Saunders, M. and Townsend, K. (2016). Reporting and justifying the number of interview participants in organization and workplace research. British Journal of Management, 27, 836–52. Available at https://doi.org/10.1111/1467-8551.12182/ Scullen, S.E., Mount, M.K. and Goff, M. (2000). Understanding the latent structure of job performance ratings. Journal of Applied Psychology, 85(6), 956–70. Shet, S.V., Patil, S.V. and Chandawarkar, M.R. (2019). Competency based superior performance and organisational effectiveness. International Journal of Productivity and Performance Management, 68(4), 753–73. Shet, S.V., Poddar, T., Samuel, F.W. and Dwivedi, Y.K. (2021). Examining the determinants of successful adoption of data analytics in human resource management: A framework for implications. Journal of Business Research, 131, 311–26. Snell, A. (2006). Researching onboarding best practice: Using research to connect onboarding processes with employee satisfaction. Strategic HR Review, 5(6), 32–5. Available at https://doi.org/10.1108/ 14754390680000925. Soundararajan, R. and Singh, K. (2016). Winning on HR Analytics: Leveraging Data for Competitive Advantage. New Delhi: Sage Publications. Sparrow, P., Hird, M. and Cooper, C. (2015). Do We Need HR? Repositioning People Management for Success. Basingstoke: Palgrave Macmillan. Tambe, P., Cappelli, P. and Yakubovich, V. (2019). Artificial intelligence in human resources management: Challenges and a path forward. California Management Review, 61(4), 15–42.
194 Handbook of big data research methods Teece, D.J., Pisano, G. and Shuen, A. (1997). Dynamic capabilities and strategic management. Strategic Management Journal, 18(7), 509–33. The Guardian (2021). Unconscious bias training alone will not stop discrimination, say critics. The Guardian, 2 March. Available at https://www.theguardian.com/money/2021/mar/02/unconscious -bias-training-alone-will-not-stop-discrimination-say-critics. University of Melbourne (2020). Entry barriers for women are amplified by AI in recruitment algorithms, study finds. Available at https://about.unimelb.edu.au/newsroom/news/2020/december/entry-barriers -for-women-are-amplified-by-ai-in-recruitment-algorithms,-study-finds. Vihari, N.S. and Rao, M.K. (2013). Analytics as a predictor for strategic and sustainable human resources function: An integrative literature review. Doctoral dissertation, IIT Roorkee, Roorkee. Wamba, S.F., Gunasekaran, A., Akter, S., Ren, S.J.F., Dubey, R. and Childe, S.J. (2017). Big data analytics and firm performance: Effects of dynamic capabilities. Journal of Business Research, 70, 356–65. Waters, S.D., Streets, V.N., McFarlane, L.A. and Johnson-Murray, R. (2018). The Practical Guide to HR Analytics: Using Data to Inform, Transform, and Empower HR Decisions. Society for Human Resource Management. Woo, S.E., O’Boyle, E.H. and Spector, P.E. (2017). Best practices in developing, conducting, and evaluating inductive research. Human Resource Management Review, 27(2), 255–64. Yam, J. and Skorburg, J.A. (2021). From human resources to human rights: Impact assessments for hiring algorithms. Ethics and Information Technology, 23(4), 611–23.
13. Embracing Data-Driven Analytics (DDA) in human resource management to measure the organization performance P.S. Varsha and S. Nithya Shree
INTRODUCTION HR analytics has become a more popular trend in the field of the human resources (HR) domain (King, 2016; Marler and Boudreau, 2017; Van den Heuvel and Bondarouk, 2017; Huselid, 2018; Kryscynski et al., 2018; McIver et al., 2018; Tursunbayeva et al., 2018; Ben-Gal, 2019) and has been labelled a gamechanger for the HR department to enable fact-based decision making to enhance business output (Marler and Boudreau, 2017; Van der Togt and Rasmussen, 2017). In addition, the penetration of big data analytics is able to redefine their process in the various domains to achieve a competitive advantage in the organization (Akter et al., 2020). Subsequently, the people management career is a hot topic of discussion among scholars, industry practitioners, idealistic leaders and technology experts who have the potential to know the significance of analytics in various human resource (HR) functions (Greasley and Thomas, 2020). HR analytics is also referred to as workforce, human capital, or people analytics (Bassi, 2011; Boudreau and Cascio, 2017; Falletta, 2014; Guzzo et al., 2015; Strohmeier and Piazza, 2015; Huselid, 2018; Marler and Boudreau, 2017). In anticipation of the upcoming era, innovation can help to transform HR practices in the present context (Ulrich, 1996). Several transitions have happened in the data world due to new emerging technologies and the availability of human resource (HR) information (Van den Heuvel and Bondarouk, 2017). An increase in digitization with the advancement of new information technology like human resource information systems (HRIS), artificial intelligence (AI) and machine learning (ML) tools offered to HR professionals to provide data-driven analytics (DDA) decisions in firms is helping to empower strategies, increase cross-culture diversity and organization success (McCartney and Fu, 2022). Several years ago, organizations began to leverage workforce-related data to draw strategic decisions using DDA, and it has grown significantly. In addition, the use of HR analytics has increased in companies to enrich the organizations’ performance (McCartney et al., 2020). Thus, HR professionals apply the various tools and analyse employee data for their performance outcomes (McIver et al., 2018). To give an example, Google started to collect and analyse candidate data to make DDA decisions on recruitment and selection (Harris et al., 2011; Shrivastava et al., 2018). The Bank of America, in collaboration with Humanyze, analyses the data from various sources such as cell phones, emails, voice/image recognition, social media posts, and sensors to make strategic decisions at the workplace (Kane et al., 2015). In addition, the Shell oil and gas organization developed their human resource analytics abilities through data analytics by recruiting employees having a wide array of knowledge in statistics, applied mathematics, psychology and behavioural economics who can narrate the storytelling 195
196 Handbook of big data research methods through data (Van der Togt and Rasmussen, 2017). Hence in the digital age, the demand for HR professionals eventually increased. Subsequently, the data visualization or digitalization of human resource management operations brings new opportunities for people management from historical to streaming data generated through human resource information systems (HRIS) (Stone et al., 2015; Van den Heuvel and Bonarouk, 2017). Hence there is a wide range of data sources that can be evaluated to consider decisions on organizational and individual outcomes (Fabbri et al., 2019). More frequently, decision-making has been noted as one of the critical components in the workforce process comprising employee performance, behaviour, motivation and stress management (Griffin and Moorhead, 2011). It’s quite crucial for HR practices to be able to implement, align and sync with larger expectations of company objectives. For instance, the routine behaviour of employee skills, knowledge, and diversified competitive business strategies can be closely related to decision-making, enhancing organization performance (Mohammed and Quddus, 2019). Deploying new-age technology born as HR analytics in order to address the HR operations has made a significant impact on individual and organizational performance (Sharma and Sharma, 2017). The term “HR analytics” is the new boon that has a different meaning in different contexts (Bassi, 2011). Van den Heuvel and Bondarouk (2017) revealed in their study that human resource analytics (HRA) is a systematic procedure to quantify people driving a process to decision outcomes by using data. In a similar vein, HR analytics is data-driven decision-making by using analytics to achieve competitive advantage. People analytics is the popular buzzword across the globe in the HR domain. By observing the prodigy of HRA, several researchers and practitioners are deploying this mechanism and noting it too (Platanou and Makela, 2016). However, despite HRA becoming mainstream in the corporates for evidence-based decisions to improve firms’ performance, present studies of HR analytics are sparse, in the ideation phase or still underdeveloped (McCartney and Fu, 2022). Furthermore, not much is known about the HR process and capabilities to make smart decisions (Falletta and Combs, 2020). At the professional level, there is even less academic research in HR analytics (Ben-Gal, 2019; Marler and Boudreau, 2017) and HR analytics perspectives are being condemned as a fad (Ben-Gal, 2019; Marler and Boudreau, 2017; Qureshi, 2020; Rasmussen and Ulrich, 2015). Subsequently, studies examining how and to what extent HR analytics affects and influences firms’ performance give very minimal information (Huselid, 2018; Minbaeva, 2018). To fill these voids our study addresses the following research questions: How did HR analytics evolve and how has it benefited firms? R1: R2: How do HRM functions use the information for data-driven analytics(DDA) to make decisions through visualization? To answer these research questions, first, this study identifies contributions discussing the HR analytics evolution, definitions and terminologies as part of a literature review. Second, the study proposes a framework by identifying the factors which emphasize the significance of HR analytics. Third, the research methodology will address the second research question and the data collection process. Fourth, research findings, theoretical implications, managerial implications, limitations and directions for future research are discussed.
Embracing DDA in human resource management 197
R1:
How did HR analytics evolve and how has it benefited firms?
Several studies are revealed to understand the evolution of HR analytics from 1980 to the present day in the literature section. These analytics help the firms to recognize factors and propose the conceptual framework for firms’ growth and benefits.
LITERATURE REVIEW Evolution of HR Analytics Year 1980: in the 1980s the first automation began with HRM processes such as payroll and data administration to attract researchers, to address and test the factors which affect the adoption of Human Resource Information Systems (HRIS). This shows HR practices to be automated (DeSanctis, 1986; Mathys and LaVan, 1982; Lederer, 1984; Magnus and Grossman, 1985; Taylor and Davis, 1989). This study shows that academicians and industry experts know about the advancement of technologies in HRIS for organizations to generate reports. However, there is limited proficiency to deploy this process in both academicians and industry practitioners. Year 1990: during the 1990s, there were rapid developments in both academia and business. At this time, use of the information system was limited; however, there is evidence of an increase in the practice of HRIS and in companies able to exhibit the technology by implementing computer systems in HRM (Kossek et al.,1994; Mathieson, 1993; Hannon et al., 1996; Haines and Petit, 1997). This decade saw only 1 per cent of global communications being handled by the internet in the year 1993, reaching 51 per cent by 2000, and more than 97 per cent by 2007 (Hilbert and Lopez, 2011). In this period, the transition focus was on the people as an asset of the organization and their capability to create competitive advantage (Pfeffer, 1994; Ulrich, 1996; Wright et al., 2001). Thus, human capital become a buzzword in both academia and industry (Edvinsson, 1997). Year 2000: in this period, much attention was given to developing new techniques to measure human and intellectual capital (Bontis and Fitz-Enz, 2002). During the first half of the 2000s, novel ideas like HR and workforce scorecards were developed (Huselid et al., 2005; Ulrich and Beatty, 2001). These tools benefited the organizations in measuring the intensive effect of HR activities and practices on company performance. During the mid-2000s, more scientific and evidence-based approaches to HR became more popular and were implemented in the firms (Boudreau and Ramstad, 2007; Pfeffer and Sutton, 2006; Rynes et al., 2007). Year 2010 onwards: the previous works revealed that HRA has existed as a research subject for about 15 years (Angrave et al., 2016). Subsequently, HRA gained momentum and was a topic for conversation in journals mainly focused on HR and people strategy (Fink, 2010; Levenson, 2005; 2011; Waber, 2013). In the last few years, HRA has received considerable attention in knowledge outlets such as the Harvard Business Review, and several reports developed by global consulting and technology giants (Madsen and Slatten, 2017). Lastly, the topic of HRA is presently much debated in most of the HR literature (Rasmussen and Ulrich, 2015; Ulrich and Dulebohn, 2015). Currently, the main perspective of the research lens on HRA is how to use analytics as a decision support to predict the future (Fitz-Enz and
198 Handbook of big data research methods Mattox, 2014; Van den Heuvel and Bondarouk, 2017). Furthermore, the evidence shows that using big data in human resource functions supports HR-related strategic decision-making processes (Angrave et al., 2016; Shah et al., 2017). HR Analytics Definition and Terminologies HR analytics is a mechanism to understand the impact of HR practices and policies to increase organization performance. The organization’s effectiveness can be evaluated by statistical techniques and experimental approaches used to validate the causal relationship of HR variables such as performance metrics on employee working, the profit of each project, employee training, and so on (Lawler et al., 2004). This is an evidence-based approach to making decisions on employee data showcasing HR metrics on predictive modelling (Bassi, 2011). HR analytics is also referred to to demonstrate the role of employee data in improving business results (Mondore et al., 2011). HR intelligence and analytics are defined as the complete information technology-based process to provide employee information (Strohmeier and Piazza, 2015). It’s a systematic way to identify and quantify human data for business outcomes for better decisions (Van den Heuvel and Bondarouk, 2017). People analytics is a new area of HRM practice that includes research and innovation using digital technologies and data analytics generating meaningful insights about workforce dynamics, human capital, and individual or team performance through data visualization. Managers may use people analytics to improve a firm’s productivity, performance, and employee experience (Tursunbayeva et al., 2018). Workforce analytics is a process that starts to understand, quantify, manage and improve the purpose of talent in the execution of strategy and value creation. It provides a deep dive into metrics and analytics (Huselid, 2018). It’s a proactive and systematic procedure to gather, analyse and communicate, and evidence-based research and analytical outcomes then lead to data-driven decisions (Falletta and Combs, 2020). Lastly, Table 13.1 summarizes the several definitions and labels associated with HR analytics to provide strategic decisions to firms. Table 13.1
Definition of HR analytics
Label
Definition
Attributes
HR Analytics
It requires complex and composite projects able to
Decision-making, Statistical Angrave et al., (2016)
Author
frame questions, research design, data collected from
and econometric model,
firms, statistical and econometric models incorporated
Data analysis
at various levels to understand the complexity as a solution for future management actions HR Analytics
Defined as the wide array of measurement and
The process to address,
Levenson and Fink (2017)
analytical outlook. It is unique from HR and influences Quantify ROI the procedure to address the firms on how to improve return on investment HR Analytics
In the HR domain, information technology uses
Statistical analysis,
Marler and Boudreau
descriptive analytics and visualization to analyse the
Decisions, Technology
(2017)
data relevant to the HR process, human capital, firm performance and external economic standards to bring the business impact in the firms through data-driven decision-making
Embracing DDA in human resource management 199 Label
Definition
People Analytics Utilizes human data like people’s behaviour, relationships and traits to make decisions by data
Attributes
Author
Decision-making, Data
Nielsen and McCullough
Analysis, Prediction
(2018)
Prediction, Statistical
Gittell and Ali (2021)
analysis; minimizing risk and predicting future business outcomes People Analytics The implementation of predictive analysis and modelling, big data and artificial intelligence in people
analysis, Big Data, AI
management Decision-making, Data
Workforce
A continuous process able to solve the problem
Analytics
through a specific research approach, data analysis with Analysis, Technology
Mclver et al. (2018)
technology helps firms’ decision-making
HRA FRAMEWORK As we mentioned, HR analytics is a very nascent domain and its unique work is remarkable in organizations. The authors identified the factors enabling HR analytics to increase the efficiency of organizations, developing the framework in Figure 13.1.
Figure 13.1
Human Resource Analytics (HRA) conceptual framework
HRA Ontogeny The insight of analytics in the human resource management area is known to academicians and industry experts in the big data world. The objective of people analytics is to collect and maintain data to forecast long and short periods in demand and supply regarding employees in various industries and occupations at the global level in order to make decisions related to acquisitions, growth and retention strategies (Kapoor and Sherif, 2012). This analytics provides the organization with data-driven insights to manage employees to achieve business
200 Handbook of big data research methods tasks efficiently (Davenport et al., 2010; Hota and Ghosh, 2013). The analytics in people management helps to make strategies in the organization (Van Heuvel and Bondarouk, 2017; Ben-Gal, 2019; Kapoor and Sherif, 2012; Levenson, 2005; Levenson, 2011). Sapience Analytics HR Professionals HR professionals are aware of data literacy in the workforce and specific tools are used in predictive and prescriptive data analysis. Data analysis is able to analyse employee demographics, skills, competencies and performance. It helps managers to select the right talent and to minimize risks (Patre, 2016). It’s a strategic management tool that yields return on investment (ROI) in the organization for its development (Delbridge and Barton, 2002). The HR person understands the organization’s long-term goals that impact analytics in the business (Welbourne, 2015; Ben-Gal, 2019). HRA Realm Google is one of the top global IT giants in the world supporting innovative work by incorporating advanced analytics in HR practices (Shrivastava et al., 2018). People analytics is one of the main domains for capturing employee data and now it is a base for people-centric decisions (Talent Management and HR, 2014). HR analytics helps to increase employee productivity and retain talent (Boakye and Lamptey, 2020). Workforce analytics creates value creation in leadership, team competence, safety, operational efficiency, training and customer satisfaction (Rasmussen and Ulrich, 2015). A few studies affirmed that human capital analytics helps to increase efficiency and data transparency in recruitment, training, performance management and retention, enriching managerial decision making (Nagpal and Mishra, 2021). HRA Gauge People analytics is used to measure the organization’s performance to solve certain managerial issues based on employee competency to attain better decision-making (Ryan, 2020). In addition, analytics gives novel insight to managers so that they can control the organization’s culture and increase productivity (Corritore et al., 2020). It is one of the key tools in talent management, applying predictive analytics to establish joining dates/delay, selection and offer acceptance (Srivastava et al., 2015). HRA Experience The buzz created around automation in the analytics phenomenon was aided by technology, which is gaining more momentum in the workplace (Fernandez, 2019). HR analytics is important to managers in making decisions based on data. Managers get experience with machine learning with automation which helps to bring new hires, retain employees, and in promotion and training programmes, helping the firm to grow (Fernandez, 2019). However, in developing countries, HR people are still in the intermediate phase in data-driven decision-making.
Embracing DDA in human resource management 201 Embracing HRA Success People management is considered to be a barometer for the data analytics crusade where HR professionals are ready to adopt analytics in the organization (SHRM Foundation, 2016). It involves the analysis of HR data which includes data integration from internal and external functions of the firm with the help of HRA tools. Lastly, all of this data analysis is used to support HR-related decision-making in organizations about business outcomes and organizational performance. To take an example, Deloitte’s survey of 10 447 business and HR leaders discovered that although 71 per cent of companies maintain that deploying workforce analytics is vital for business performance, only 15 per cent used it to create diversity in the workforce by creating a talent scorecard for managers (Walsh and Volini, 2017). The rise of data analytics in organizations is slowly enabling decisions made from interesting statistics to have a strategic impact (Mclver et al., 2018). These analytics are aligned to enhance the data-driven strategy outcomes and organization efficiency (Levenson, 2018). HR Metrics Metrics represent the data in numbers, which mirrors the descriptive way of narrating results such as the success of a recruitment drive (Carlson and Kavanagh, 2011). Several HR metrics (revenue of employee, cost per hire, absentee rate, workers’ compensation, and so on) are more often used in companies, which can be tracked by the novel work of Dr Jac Fitz-Enz carried out at the Saratoga Institute (Carlson and Kavanagh, 2011). Lawler et al. (2004) revealed in their research what differentiates HR analytics from HR metrics. HR metrics are used to measure the key people management outcomes of efficiency, impact, performance and effectiveness. HR analytics cannot measure the outcomes, but showcases statistical techniques and experimental/ evidence-based approaches to manifest the impact on HR activities (Levenson et al., 2005). Thus HR analytics is a proof-based approach to making better decisions on the human side, which includes the group of tools and technologies ranging from simple reporting of HR metrics to predictive modelling (Bassi, 2011). Finally, to sum up, the use of HR metrics in HR analytics helps to make evidence-based decisions (Edwards, 2018). HRA Efficiency Roadmap In general, HR analytics will analyse the data to make better decisions on organizations’ productivity, employee engagement, hiring, retention, training and performance management (McCartney et al., 2020). The important HR analytics tools are R, Excel, SPSS, Power BI (tableau to develop dashboards for recruitment cycle, business analysis, and so on), Zoho People (software to manage workforce and employee-related issues), People Inside (cloud-based analytics to make business decisions through statistics), Crunchhr (cloud analytics to improve the productivity of work), BambooHR (employee benefits tracker app) and IBM Kenexa (Employee hiring and retention) used to attain success in organizations (Wandhe, 2020). Hence in today’s data-driven world, HR analytics quantifies the work, deploys statistics models/ machine learning operations/ automation, and increases firms’ performance (Wandhe, 2020).
202 Handbook of big data research methods HRA Applications HR analytics is a new concept in the management domain of business applications (Baesens et al., 2017). The application of AI and data analytics in people management functions provides the solution for gender bias during the recruitment process and in job portals (Rangaiah, 2020). The hallmark of HR analytics is that it strengthens organization performance. At Accenture and Oracle, studies revealed that 58 per cent of employees stated that they were all aware that the talent cloud improves profitability and also 60 per cent of executives conveyed that it increases revenue per employee (Oracle and Accenture, 2014). At Google, the “Project Oxygen” initiative uses people analytics to rank leadership traits for managers, which include comments, behaviour, performance and employee surveys (Shrivastava et al., 2018). Automation in HR Today’s AI-driven world is applied to all domains including HRM, specifically in recruitment. Automation in recruitment increases organization efficiency (Van Esch et al., 2019; Mehta et al., 2013) through better decision-making, leading to greater economic well-being (Jia et al., 2018). Artificial intelligence can minimize gender bias, with several algorithmic recommendations for screening résumés and evaluating talent/ competency, but which can sometimes also provide flawed results (Lauret, 2019). One example is IBM, which leveraged AI for customization of jobs based on an employee sentiment centric context (Guenole and Feinzig, 2019). Also, between 2018 and 2019, Amazon leveraged AI to measure employee performance, recruitment and decisions on job terminations (Lecher, 2019). Also, this study revealed that AI is used to identify the dark side of employees in the organization to measure performance based on their perceptions in emotional, mental, bias, manipulation, privacy and social burdens (Park et al., 2021). R2: How do HRM functions use the information for data-driven analytics(DDA) to make decisions through visualization? Several researchers have conducted studies on HRM functions using data-driven analytics in a qualitative approach (Falletta and Combs, 2020; Fernandez and Gallardo-Gallardo, 2020; McCartney and Fu, 2022). Due to the advancement of technology, data visualization is considered a storytelling tool (Knaflic, 2015). Thus HR practitioners leverage technology by explaining strategic stories using HR information to make decisions through data visualization where users can analyse and interpret data for meaningful insights (Kosara, 2016). Much of the software provides dashboards, which are able to display the multiple visualizations, embellishments and interactive sources of HR data (Caughlin and Bauer, 2019). SAP and Oracle companies have deployed HRIS software like Workday and OrangeHRM, with dashboard and visualization which allows users to personalize and interact with HR data through graphs (Caughlin and Bauer, 2019). The tools for data visualization are Tableau, Microsoft Power BI, R and Python (Caughlin and Bauer, 2019). Employee monitoring tools in the workforce to collect the reports of employees such as full-time employees and absenteeism ratios are represented in dashboards and scorecards (Angrave et al., 2016; Marler and Boudreau, 2017), which comprise historical data (Angrave et al., 2016) and survey information leading to benchmarking (Davenport and Harris, 2017). Yigitbasioglu and Velcu (2012) mentioned in
Embracing DDA in human resource management 203 their study on the evaluation of performance management dashboards in the HR domain that relatively little research has been conducted on data visualization to enhance business decisions. At this point, research on the use of a dashboard for interpretation and analysis is very limited. Hence, our study provides descriptive research and data analytics to bring significant research outcomes. Further, our research contributes to the transition to using data-driven analytics successfully in the HR profession.
RESEARCH METHODOLOGY Questionnaire Development While constructing the questionnaire, we interviewed and discussed with academicians and experts and found that recruitment, performance management, training and payroll are the important HR functions used to make decisions through HR analytics. By considering the input from the experts, we developed a systematic questionnaire using a five-point Likert scale and collected the data online. Data Collection and Techniques We received 104 responses and the demography for data collection was urban Bangalore; convenience sampling techniques were used to collect the data. Primary data was collected in April 2021 (four weeks). The majority of companies provided a dashboard with data visualization configuration software, which helps the HR professionals to customize and interact with data through graphs. A tableau software tool is used for data analysis and visualization, and can create a gamut of graphs and interactive dashboards to bring more novelty solutions to the firms. Dashboards always enhance reports. They emulate the integration of real-time analysis of the organization and HR processes, increasing the capacity of HR data. The dashboard evaluates the data and helps managers to examine the metrics at various levels in the organization (Carlson and Kavanagh, 2011). Our research examined the data through descriptive statistics. Furthermore, the ten factors which were proposed in the conceptual framework contribute to value creation in the organization.
FINDINGS AND ANALYSIS Table 13.2
Dashboard A
Factors
Data Analysis Outcome
HRA Ontogeny
54%
Analytics Sapience HR Professionals
59%
HR Realm
a. Talent Management
47%
b. HR reports
24.4%
204 Handbook of big data research methods Factors
Data Analysis Outcome
Metrics
a. HR Metrics
26.92%
b. Dashboards
3.85%
HR Analytics Experience
11.54%
HRA Gauge
a. Employee Competency
5.77%
b. Employee Engagement c. Organizational Performance d. Knowledge
21.15% 41.35% 31.73%
The study results represented in Table 13.2 describe HRA Ontogeny in the organization: 54 per cent of respondents conveyed that they were aware of the rise of analytics for decision-making in the organization. Furthermore, on Sapience Analytics HR Professionals, 59 per cent of the respondents affirmed that employees understand that the core concepts and crux of data-driven analytics in people management are most important. In addition, HR Realm helps to implement analytics in talent management (47 per cent), HR reporting (24.4 per cent), HR value added metrics (26.92 per cent) and dashboards (3.85 per cent). Employees who have Experience in HR analytics of more than five years number 11.54 per cent in their organization. All these results conclude that HR analytics is still in its infancy in start-ups, and in mid-size and multinational companies. Lastly, HRA Gauge measured people skills based on employee competency (5.77 per cent), employee engagement (21.15 per cent), organizational performance (41.35 per cent) and knowledge (31.73 per cent). Table 13.3
Dashboard B
Factors
Data Analysis Outcome
Embracing HRA Success
a. Employee efficiency
31.73%
b. Training c. Workforce diversity
35.58% 13.46%
d. E-recruitment
19.23%
Employee Data Metrics
a. Employee job type
66%
b. Absenteeism
19%
HR efficiency road map
a. Employee value and performance
47%
b. Quality of hire c. Cost per hire d. Employee turnover
24% 18% 8%
Embracing DDA in human resource management 205 Factors
Data Analysis Outcome
Big data in HRA Automation
a. Talent driven future jobs
40%
b. HR tasks customization c. Employee emotional intelligence d. Time management & discipline
23% 16% 15%
e. Change champs / agents
10%
Automation in HR
a. HR metrics
43%
b. Employee data analysis c. Predicting for HR decisions d. Developing the HR scorecard
21% 14% 7%
The study results represented in Table 13.3 explain Embracing HRA Success to make strategic decisions through HR data by increasing employee efficiency (31.73 per cent), effective training (35.58 per cent), diversity in the workforce (13.46 per cent) and e-recruitment (19.23 per cent). Hence, considering data analysis of HRA success is still in an intermediate phase in the organization. Also, most of the time Employee Data Metrics gathered information through workforce statistics using employee job type (66 per cent) and absenteeism (19 per cent) to measure employee efficiency. Thus employee data metrics categorize the various jobs and measure to improve organization efficiency. In addition, HRA Efficiency Road Map helps to measure the employee value and performance (47 per cent), quality of hires (24 per cent), cost per hire (18 per cent) and turnover (8 per cent). Thus the analytics create a road map mainly concerned with employee engagement at the workplace. Also, with the help of big data, HRA Automation helps to create more talent-driven future jobs (40 per cent), developing customized human resource tasks (23 per cent), balances emotional intelligence of employees (16 per cent), enforcing time management and discipline in the organization (15 per cent) and helping people to be change champs/agents (10 per cent). Finally, HR Automation leads to organization success by developing HR metrics (43 per cent), employee data analysis (21 per cent), predicting HR decisions (14 per cent) and developing the HR scorecard (7 per cent).
DISCUSSION The detailed rise of people management analytics has been a strong competency for several decades. Workforce analytics provides the measurements and metrics based on the emergence of big data to achieve company success. Workforce analytics in the companies has shown that real-life examples caught people’s attention in a very short space of time. Further, the concepts fit well with the present data streaming world and it is becoming more fashionable to make data-driven analytics (DDA) decisions to achieve productivity. Hence, HR analytics are helping many firms to attain strong positions on the evidence-based approach in social science domains as an important strategy (Madsen and Slatten, 2017). Hence our research provides an overview of HR analytics, which can be profitable to organizations. It’s a kind of storytelling with human resource data, which helps the company to improve (Andersen, 2017). Also, the study describes the ten variables and how they support the organization in various ways: HR tools bring profit to the organization (Wandhe, 2020). Automation helps the organization to grow and tackle the employees’ dark side in the workplace (Park et al., 2021).
206 Handbook of big data research methods Dashboards through data visualization facilitate employee awareness of HR analytics and its adoption in the organization for strategic growth. Hence, our findings provide knowledge of HR analytics and its support for previous and present studies.
THEORETICAL IMPLICATIONS According to our perspective, the present research is the first link with Human capital theory (Becker, 2009) and the Human capital resource model (Ployhart et al., 2014). HR managers are able to understand the KSA (knowledge, skills, attitude) functions. Furthermore, managers collect data from various organizational divisions to conduct data analysis and make data-driven decisions. Thus, we can conclude that a more operational, technical and data-centric approach is required to achieve strategic outcomes from the numerous HR functions. The Theory of Planned Behavior helps to identify the factors required to implement HR analytics in the organization and increase its adoption rate (Ajzen, 1991). The theory of planned behavior supports empirical evidence by using HRA tools for data analysis. Also, it is proved in many organizations that to capture employee data such as behaviours, perceptions, emotions, performance, and attitudes can be predicted with high data accuracy. This theory was supported by and predicted employee behaviour to enrich the organization’s capability (Ajzen,1985). Also, the LAMP framework (logic, analytics, measures and process) in human capital proclaims that these components are important to measure evidence-based data analysis, relationships and to enhance decisions (Boudreau and Ramstad, 2007). They also state that these four elements help to analyse the cause–effect relationship of human resource management and strategic business results. This model is a multiphase process able to explain the importance of a scorecard to link the HRM process to measuring outcomes in the companies (Becker et al., 2001). HCM:21 (Human Capital Management) is a logical framework used to collect, organize and interpret data for decision-making for future outcomes. This framework has four phases: scanning, planning, producing and predicting. This model is unique in people management in connecting the HR functions of hiring, salaries, training and talent management (Fitz-Enz, 2010). Furthermore, our study integrates the evidence-based management theory (EBM) (Rousseau and Barends, 2011; Baba and Hakem Zadeh, 2012; Bezzina et al., 2017), the resource-based view (RBV) (Barney, 1991) and dynamic capability (Teece et al., 1997; Winter, 2003) to measure firm performance through HR analytics. All these novel contributions signify the importance of data-driven analytics in human resource management functions.
MANAGERIAL IMPLICATIONS The rapid growth of HR analytics will create a challenge for HR leaders to integrate analytics in their decision-making. It requires huge investments in the organizations to evaluate employee performance. The data-driven analytics (DDA) decisions will be considered by many CEOs who invest in their organizations to increase firms’ efficiency by using HR metrics, dashboards and scorecards. Further, this analytics requires HR managers to understand data literacy and to have analytical skills in order to achieve strategic outcomes. They need separate departments to educate individuals in analytical skills, from freshers to managers (at grassroots level).
Embracing DDA in human resource management 207
LIMITATIONS, FUTURE RESEARCH AND CONCLUSION Our data was very limited in being able to produce wide-ranging results. People are still at the intermediate stage of being able to understand how to make strategic decisions using HR analytics. Future research could be carried out, specific to the organization, including a comparative study on HR analytics in IT, E-commerce, banking, insurance and healthcare domain to measure the nuances of organization performance. We suggest future directions for research in Table 13.4. Table 13.4
Future directions of research
Research
Research Gap
Proposed Questions
HR tools
1. Lack of studies explored to use various HR
a. Which are the HR tools that bring value
tools to measure firms’ performance 2. Lack of competencies in the HR analytics department concerning knowledge, skills, abilities (KSA) of team members HR Analytics
1. Lack of studies in data ethical and privacy implication in HR analytics in the Indian context
Data
1. Lack of studies in data integration and sharing
creation? b. How do employees’ KSA of HR analytics team members become more effective to bring best HR practices in the firms? a. How to minimize the risks and challenges in the deployment of HR analytics in the firms? a. How data integration helps in HR analytics to achieve competitive advantage
2. Fewer studies in HR analytics from a strategic business perspective Data Visualization
1. Lack of studies in storytelling skills in firms
a. What are the features and purposes to use a dashboard? b. How do dashboards bring value creation in recruitment, performance management and training?
Research Methodology
1. Limited studies in empirical research 2. The majority of the studies are literature reviews with cases
a. What are the factors to bring success in the firms by using HR analytics? b. How are HR analytics failed in midsize companies?
HRM Functions
1. Fewer studies found in talent management, absenteeism, data Culture
a. How to bring the best talent pool in the firms using HR analytics? b. How to minimize absenteeism by using people analytics c. What are the approach and methods required to bring the data culture in HRM functions?
Bias
1. Fewer studies are found to address the HR analytics bias
a. How gender and racial biases are tackled in the firms by using HR analytics
Lastly, our study has taken the growth of HRA in the organization for data-driven analytics (DDA) decisions from a micro to a macro view. Human resource analytics is a promising and emerging field and companies are adopting analytics for data-driven decision-making. The current scenario of big data and automation helps firms to tackle global competition. Thus the field of HR analytics is a new concept in developing countries that will have several benefits to the organization and achieve the SDGs Goal 5 of Gender Equality.
208 Handbook of big data research methods
REFERENCES Ajzen, I. (1985). From intentions to actions: A theory of planned behavior. In J. Kuhl and J. Beckmann (eds), Action Control, Springer Verlag, pp. 11–39. Ajzen, I. (1991). The theory of planned behavior. Organizational Behavior and Human Decision Processes, 50(2), 179–211. Akter, S., Gunasekaran, A., Wamba, S.F., Babu, M.M. and Hani, U. (2020). Reshaping competitive advantages with analytics capabilities in service systems. Technological Forecasting and Social Change, 159, 120180. Andersen, M.K. (2017). Human capital analytics: The winding road. Journal of Organizational Effectiveness: People and Performance, 4(2), 133–6. Angrave, D., Charlwood, A., Kirkpatrick, I., Lawrence, M. and Stuart, M. (2016). HR and analytics: Why HR is set to fail the big data challenge. Human Resource Management Journal, 26(1), 1–11. Baba, V.V. and Hakem Zadeh, F. (2012). Toward a theory of evidence-based decision making. Management Decision, 50(5), 832–67. Baesens, B., De Winne, S. and Sels, L. (2017). Is your company ready for HR analytics?. MIT Sloan Management Review, 58(2), 20. Barney, J.B. (1991). The resource based view of strategy: Origins, implications, and prospects. Journal of Management, 17(1), 97–211. Bassi, L. (2011). Raging debates in HR analytics. People and Strategy, 34(2), 14. Becker, B.E., Huselid, M.A. and Ulrich, D. (2001). The HR Scorecard: Linking People, Strategy, and Performance. Boston, MA: Harvard Business Press. Becker, G.S. (2009). Human Capital: A Theoretical and Empirical Analysis, with Special Reference to Education. Chicago, IL: University of Chicago Press. Ben-Gal, H.C. (2019). An ROI-based review of HR analytics: Practical implementation tools. Personnel Review, 48(6), 1429–48. Bezzina, F., Cassar, V., Tracz-Krupa, K., Przytuła, S. and Tipurić, D. (2017). Evidence-based human resource management practices in three EU developing member states: Can managers tell truth from fallacy?. European Management Journal, 35(5), 688–700. Boakye, A. and Lamptey, Y.A. (2020). The rise of HR Analytics: Exploring its implications from a developing country perspective. Journal of Human Resource Management, 8(3), 181–9. Bontis, N. and Fitz‐Enz, J. (2002). Intellectual capital ROI: A causal map of human capital antecedents and consequents. Journal of Intellectual Capital, 3(3), 223–47. Boudreau, J. and Cascio, W. (2017). Human capital analytics: Why are we not there? Journal of Organizational Effectiveness: People and Performance, 4(2), 119–26. Boudreau, J.W. and Ramstad, P.M. (2007). Beyond HR: The New Science of Human Capital. Harvard Business Press. Carlson, K.D. and Kavanagh, M.J. (2011). HR metrics and workforce analytics. In R.D. Johnson, K.D. Carlson and M.J. Kavanagh (eds), Human Resource Information Systems: Basics, Applications, and Future Directions. Thousand Oaks, CA: Sage Publications, pp. 150–87. Caughlin, D.E. and Bauer, T.N. (2019). Data visualizations and human resource management: The state of science and practice. Research in Personnel and Human Resources Management, 37, 89–132. Corritore, M., Goldberg, A. and Srivastava, S.B. (2020). The new analytics of culture. Harvard Business Review, 98(1), 76–83. Davenport, T. and Harris, J. (2017). Competing on Analytics: The New Science of Winning. Boston, MA: Harvard Business Review Press. Davenport, T.H., Harris, J. and Shapiro, J. (2010). Competing on talent analytics. Harvard Business Review, 88(10), 52–8. Delbridge, R. and Barton, H. (2002). Organizing for continuous improvement: Structures and roles in automotive components plants. International Journal of Operations & Production Management, 22(6), 680–92. DeSanctis, G. (1986). Human resource information systems: A current assessment, MIS Quarterly, 15–27. Edvinsson, L. (1997). Developing intellectual capital at Skandia. Long Range Planning, 30(3), 366–73.
Embracing DDA in human resource management 209 Edwards, M.R. (2018). HR metrics and analytics. e-HRM: Digital Approaches, Directions & Applications, pp. 89–105. Fabbri, T., Scapolan, A.C., Bertolotti, F. and Canali, C. (2019). HR analytics in the digital workplace: Exploring the relationship between attitudes and tracked work behaviors. In HRM 4.0 For Human-Centered Organizations, 23, 161–75. Falletta, S. (2014). In search of HR intelligence: Evidence-based HR analytics practices in high performing companies. People and Strategy, 36(4), 28. Falletta, S.V. and Combs, W.L. (2020). The HR analytics cycle: A seven-step process for building evidence-based and ethical HR analytics capabilities, Journal of Work-Applied Management, 13(1), 51–68. Fernandez, J. (2019). The ball of wax we call HR analytics. Strategic HR Review, 80(1), 21–25. Fernandez, V. and Gallardo-Gallardo, E. (2020). Tackling the HR digitalization challenge: Key factors and barriers to HR analytics adoption. Competitiveness Review: An International Business Journal, 31(1), 162–87. Fink, A.A. (2010). New trends in human capital research and analytics. People and Strategy, 33(2), 14. Fitz-Enz, J. (2010). The New HR Analytics: Predicting the Economic Value of your Company’s Human Capital Investments. Amacom. Fitz-Enz, J. and Mattox, J.R. (2014). Predictive Analytics for Human Resources. John Wiley & Sons. Gittell, J.H. and Ali, H.N. (2021). Relational Analytics: Guidelines for Analysis and Action. New York: Routledge. Greasley, K. and Thomas, P. (2020). HR analytics: The onto‐epistemology and politics of metricised HRM. Human Resource Management Journal, 30(4), 494–507. Griffin, R. and Moorhead, G. (2011). Organizational Behavior. Cengage Learning. Guenole, N. and Feinzig, S. (2018). The Business Case for AI in HR. With Insights and Tips on Getting Started. Armonk: IBM Smarter Workforce Institute, IBM Corporation. Guzzo, R.A., Fink, A.A., King, E., Tonidandel, S. and Landis, R.S. (2015). Big data recommendations for industrial-organizational psychology. Industrial and Organizational Psychology, 8(4), 491. Haines, V.Y. and Petit, A. (1997). Conditions for successful human resource information systems. Human Resource Management, 36(2), 261–75. Hannon, J., Jelf, G. and Brandes, D. (1996). Human resource information systems: Operational issues and strategic considerations in a global environment. International Journal of Human Resource Management, 7(1), 245–69. Harris, J.G., Craig, E. and Light, D.A. (2011). Talent and analytics: New approaches, higher ROI. Journal of Business Strategy, 32(6), 4–13. Hilbert, M. and Lopez, P. (2011). The world’s technological capacity to store, communicate, and compute information. Science, 332(6025), 60–65. Hota, J. and Ghosh, D. (2013). Workforce analytics approach: An emerging trend of workforce management. AIMS International Journal, 7(3), 167–79. Huselid, M.A. (2018). The science and practice of workforce analytics: Introduction to the HRM special issue. Human Resource Management, 57(3), 679–84. Huselid, M.A., Becker, B.E. and Beatty, R.W. (2005). The Workforce Scorecard: Managing Human Capital to Execute Strategy. Harvard Business Press. Jia, Q., Guo, Y., Li, R., Li, Y. and Chen, Y. (2018). A conceptual artificial intelligence application framework in human resource management. In Proceedings of the international conference on electronic business, June, pp. 106–14. Kane, G.C., Palmer, D., Phillips, A.N., Kiron, D. and Buckley, N. (2015). Strategy, not technology, drives digital transformation. MIT Sloan Management Review. Kapoor, B. and Sherif, J. (2012). Human resources in an enriched environment of business intelligence. Kybernetes, 41(10), 1625–37. King, K.G. (2016). Data analytics in human resources: A case study and critical review. Human Resource Development Review, 15(4), 487–95. Knaflic, C.N. (2015). Storytelling with Data: A Data Visualization Guide for Business Professionals. John Wiley & Sons.
210 Handbook of big data research methods Kosara, R. (2016). An empire built on sand: Reexamining what we think we know about visualization. In Proceedings of the Sixth Workshop on Beyond Time and Errors on Novel Evaluation Methods for Visualization, pp. 162–8. Kossek, E.E., Young, W., Gash, D.C. and Nichol, V. (1994). Waiting for innovation in the human resources department: Godot implements a human resource information system. Human Resource Management, 33(1), 135–59. Kryscynski, D., Reeves, C., Stice‐Lusvardi, R., Ulrich, M. and Russell, G. (2018). Analytical abilities and the performance of HR professionals. Human Resource Management, 57(3), 715–38. Lauret, J. (2019), Amazon’s sexist AI recruiting tool: How did it go so wrong? Accessed 27 September 2021 at https://becominghuman.ai/amazons-sexist-ai-recruiting-tool-how-did-it-go-so-wrong -e3d14816d98e. Lawler III, E.E., Levenson, A. and Boudreau, J.W. (2004). HR metrics and analytics: Uses and impacts. Human Resource Planning Journal, 27(4), 27–35. Lecher, C. (2019). How Amazon automatically tracks and fires warehouse workers for ‘productivity’. The Verge, 25. Lederer, A.L. (1984). Planning and developing a human resource information system. The logic of a step-by-step approach. The Personnel Administrator, 29(8), 27–39. Levenson, A. (2005). Harnessing the power of HR analytics. Strategic HR Review, 4(3), 28–31. Levenson, A. (2011). Using targeted analytics to improve talent decisions. People and Strategy, 34(2), 34. Levenson, A. (2018). Using workforce analytics to improve strategy execution. Human Resource Management, 57(3), 685–700. Levenson, A. and Fink, A. (2017). Human capital analytics: Too much data and analysis, not enough models and business insights. Journal of Organizational Effectiveness: People and Performance, 4(2), 145–56. Levenson, A., Lawler III, E.E. and Boudreau, J.W. (2005). Survey on HR Analytics and HR transformation: Feedback report. Center for Effective Organizations, University of Southern California, 131–42. Madsen, D.O. and Slatten, K. (2017). The rise of HR analytics: A preliminary exploration. Global Conference on Business and Finance Proceedings, 12(1), 148–59. Magnus, M. and Grossman, M. (1985). Computers and the personnel department. The Personnel Journal, 64(4), 42–8. Marler, J.H. and Boudreau, J.W. (2017). An evidence-based review of HR Analytics. The International Journal of Human Resource Management, 28(1), 3–26. Mathieson, K. (1993). Variations in users’ definitions of an information system. Information & management, 24(4), 227–34. Mathys, N. and LaVan, H. (1982). A survey of the human resource information systems (HRIS) of major companies. Human Resource Planning, 5(2), 83–90. McCartney, S. and Fu, N. (2022). Promise versus reality: A systematic review of the ongoing debates in people analytics. Journal of Organizational Effectiveness: People and Performance. McCartney, S., Murphy, C. and McCarthy, J. (2020). 21st century HR: A competency model for the emerging role of HR Analysts, Personnel Review, 50(6), 1495–513. McIver, D., Lengnick-Hall, M.L. and Lengnick-Hall, C.A. (2018). A strategic approach to workforce analytics: Integrating science and agility. Business Horizons, 61(3), 397–407. Mehta, S., Pimplikar, R., Singh, A., Varshney, L.R. and Visweswariah, K. (2013). Efficient multifaceted screening of job applicants. In Proceedings of the 16th International Conference on Extending Database Technology, March, pp. 661–71. Minbaeva, D.B. (2018). Building credible human capital analytics for organizational competitive advantage. Human Resource Management, 57(3), 701–13. Mohammed, D. and Quddus, A. (2019). HR analytics: A modern tool in HR for predictive decision making. Journal of Management, 6(3). Mondore, S., Douthitt, S. and Carson, M. (2011). Maximizing the impact and effectiveness of HR analytics to drive business outcomes. People and Strategy, 34(2), 20. Nagpal, T. and Mishra, M. (2021). Analyzing human resource practices for decision making in banking sector using HR analytics. Materials Today: Proceedings.
Embracing DDA in human resource management 211 Nielsen, C. and McCullough, N. (2018). How people analytics can help you change process culture and strategy. Harvard Business Review, 17. Oracle and Accenture (2014). The future of HR technologies. Accenture. Accessed 28 August 2021 at https://www.accenture.com/_acnmedia/accenture/conversion-assets/dotcom/documents/global/pdf/ digital_1/accenture-oracle-hcm-ebook-future-of-hr-five-technology-imperatives.pdf. Park, H., Ahn, D., Hosanagar, K. and Lee, J. (2021). Human–AI interaction in human resource management: Understanding why employees resist algorithmic evaluation at workplaces and how to mitigate burdens. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, May, pp. 1–15. Patre, S. (2016). Six thinking hats approach to HR analytics. South Asian Journal of Human Resources Management, 3(2), 191–9. Pfeffer, J. (1994). Competitive Advantage Through People. Boston, MA: Harvard Business School Press. Pfeffer, J. and Sutton, R.I. (2006). Management half-truths and nonsense: How to practice evidence-based management. California Management Review, 48(3), 77–100. Platanou, K. and Makela, K. (2016). HR function at the crossroads of digital disruption. Tyron, 1, 19–26. Ployhart, R.E., Nyberg, A.J., Reilly, G. and Maltarich, M.A. (2014). Human capital is dead: long live human capital resources! Journal of Management, 40(2), 371–98. Qureshi, T.M. (2020). HR analytics, fad or fashion for organizational sustainability. Sustainable Development and Social Responsibility, 1, 103–107. Rangaiah, M. (2020). What is HR analytics? Role, challenges and application. Analytics Steps, 23 January. Accessed 27 August 2021 at https://www.analyticssteps.com/blogs/what-hr-analytics-role -challenges-and-applications. Rasmussen, T. and Ulrich, D. (2015). Learning from practice: How HR analytics avoids being a management fad. Organizational Dynamics, 44(3), 236–42. Rousseau, D.M. and Barends, E.G.R. (2011). Becoming an evidence-based HR practitioner. Human Resource Management Journal, 21(3), 221–35. Ryan, J.C. (2020). Retaining, resigning and firing: Bibliometrics as a people analytics tool for examining research performance outcomes and faculty turnover. Personnel Review, 50(5), 1316–35. Rynes, S.L., Giluk, T.L. and Brown, K.G. (2007). The very separate worlds of academic and practitioner periodicals in human resource management: Implications for evidence-based management. Academy of Management Journal, 50(5), 987–1008. Shah, N., Irani, Z. and Sharif, A.M. (2017). Big data in an HR context: Exploring organizational change readiness, employee attitudes and behaviors. Journal of Business Research, 70, 366–78. Sharma, A. and Sharma, T. (2017). HR analytics and performance appraisal system: A conceptual framework for employee performance improvement. Management Research Review, 40(6), 684–97. Shrivastava, S., Nagdev, K. and Rajesh, A. (2018). Redefining HR using people analytics: The case of Google. Human Resource Management International Digest, 26(2), 3–6. SHRM Foundation (2016). Use of workforce analytics for competitive advantage. Accessed 22 August 2021 at https://www.shrm.org/foundation/ourwork/initiatives/preparing-for-future-hr-trends/ Documents/Workforce%20Analytics%20Report.pdf. Srivastava, R., Palshikar, G.K. and Pawar, S. (2015). Analytics for improving talent acquisition processes. In International Conference on Advanced Data Analysis, Business Analytics and Intelligence, ICADABAI. Stone, D.L., Deadrick, D.L., Lukaszewski, K.M. and Johnson, R. (2015). The influence of technology on the future of human resource management. Human Resource Management Review, 25(2), 216–31. Strohmeier, S. and Piazza, F. (2015). Prozesse der Human Resource Intelligence und Analytics. Human Resource Intelligence und Analytics, pp. 49–87. Talent Management and HR (2014). How Google is using people analytics to completely reinvent HR. Accessed 22 August 2021 at https://www.tlnt.com/how-google-is-using-people-analytics-to -completely-reinvent-hr-2/. Taylor, G.S. and Davis, J.S. (1989). Individual privacy and computer-based human resource information systems. Journal of Business Ethics, 8(7), 569–76. Teece, D.J., Pisano, G. and Shuen, A. (1997). Dynamic capabilities and strategic management. Strategic Management Journal, 18(7), 509–33.
212 Handbook of big data research methods Tursunbayeva, A., Di Lauro, S. and Pagliari, C. (2018). People analytics: A scoping review of conceptual boundaries and value propositions. International Journal of Information Management, 43, 224–47. Ulrich, D. (1996). Human Resource Champions: The Next Agenda for Adding Value and Delivering Results. Harvard Business Press. Ulrich, D. and Beatty, D. (2001). From partners to players: Extending the HR playing field. Human Resource Management, 40(4), 293–307. Ulrich, D. and Dulebohn, J.H. (2015). Are we there yet? What’s next for HR?. Human Resource Management Review, 25(2), 188–204. Van den Heuvel, S. and Bondarouk, T. (2017). The rise (and fall?) of HR analytics. Journal of Organizational Effectiveness: People and Performance, 4(2), 157–78. Van der Togt, J. and Rasmussen, T.H. (2017). Toward evidence-based HR. Organizational Effectiveness: People and Performance, 4(2), 127–32. Van Esch, P., Black, J.S. and Ferolie, J. (2019). Marketing AI recruitment: The next phase in job application and selection. Computers in Human Behavior, 90, 215–22. Waber, B. (2013). People Analytics: How Social Sensing Technology Will Transform Business and What it Tells Us about the Future of Work. Upper Saddle River, NJ: FT Press. Walsh, B. and Volini, E. (2017). Rewriting the rules for the digital age: 2017 Deloitte global human capital trends. Accessed 26 January 2023 at https://www2.deloitte.com/content/dam/Deloitte/global/ Documents/HumanCapital/hc-2017-global-human-capital-trends-gx.pdf. Wandhe, P. (2020). HR Analytics: A tool for strategic approach to HR productivity. Accessed at SSRN 3700502. Welbourne, T.M. (2015). Data‐driven storytelling: The missing link in HR data analytics. Employment Relations Today, 41(4), 27–33. Winter, S.G. (2003). Understanding dynamic capabilities. Strategic Management Journal, 24(10), 991–5. Wright, P.M., Dunford, B.B. and Snell, S.A. (2001). Human resources and the resource based view of the firm. Journal of Management, 27(6), 701–721. Yigitbasioglu, O.M. and Velcu, O. (2012). A review of dashboards in performance management: Implications for design and research. International Journal of Accounting Information Systems, 13(1), 41–59.
APPENDIX Questionnaire 1. How knowledgeable are you in HR analytics? 2. To what extent do you understand HR analytics as implemented in your organization to achieve sustainability? 3. Which areas of HR analytics have been put into practice in your organization? 4. In HR analytics, how do we measure a person’s (individual) abilities or character traits? 5. How long have you been working with HR analytics? 6. In the organization, how could HR analytics be implemented for a long-term strategy? 7. Is employee data collected by the HR department in an organization through workforce statistics? 8. How does HR analytics measure the firm’s efficiency and effectiveness? 9. Select the one option for application of HR analytics leading to firms’ success: a. HR metrics; b. Employee data analysis; c. Predicting for HR decisions; d. Developing the HR scorecard.
Embracing DDA in human resource management 213 10. Choose one parameter of big data and automation of HR jobs that will help in the following areas: a. Talent driven future jobs; b. HR tasks customization; c. Employee emotional intelligence; d. Time management and discipline.
14. A process framework for big data research: social network analysis using design science Denis Dennehy, Samrat Gupta and John Oredo
1. INTRODUCTION There has been contagious enthusiasm from academics and practitioners surrounding the notion of ‘big data’ and how it will revolutionize decision-making (Modgil et al., 2021; Choi et al., 2018; Fosso Wamba et al., 2017). To facilitate data-driven decision making, organizations need to invest in big data initiatives (Grover et al., 2018) to develop efficient and effective processes that will translate big data into meaningful insights (Davenport, 2018). At the same time, concerns are being raised that investing in big data initiatives does not necessarily lead to more effective decision making (Hirschheim, 2021; Ghasemaghaei et al., 2018; Dennehy et al., 2021). Furthermore, decision-making ‘is not just an act of decision-making between a given set of parameters, but it is also about the continuous act of shaping and designing of organizations and their stakeholders’ experiences’ (Avital and Te’Eni, 2009, p. 154). Big data can be characterized in terms of seven Vs, comprising volume, variety, velocity, veracity, value, variability, and visualization, which present various challenges in data management (Mikalef et al., 2017; Seddon and Currie, 2017; Gandomi and Haider, 2015). There are two key processes for extracting insight from big data: data management and big data analytics (BDA) (Gandomi and Haider, 2015). Data management refers to the processes and technologies required to collect, store and prepare data for analysis, while BDA refers to the entire process of managing, processing and analyzing the data characteristics (e.g., the Vs) to create actionable insights to deliver sustained business value, measure performance and achieve competitive advantage (Fosso Wamba et al., 2015; Watson, 2014). BDA can be categorized into three types (descriptive analytics, predictive analytics and prescriptive analytics), which have implications for the technologies and architectures used for BDA (Watson, 2014). Developing ‘big data analytics capabilities’ is an emerging technological capability to deploy an organization’s data, technology and talent effectively through firm‐wide processes, roles and structures (Mikalef et al., 2019). To date, research has largely focused on the technical aspects of big data (Mikalef et al., 2017) and its application in specific contexts (e.g., marketing, healthcare, smart cities, supply chains), but with limited attention given to the underlying process (Grover and Kar, 2017). This knowledge deficit is concerning as it is critically important to understand the processes required to leverage big data and create business value through data-driven decisions (Mikalef et al., 2017). Furthermore, processes are important because they enable organizations to standardize employee work activities, enhance their process execution, as well as benefit from process standardization (Vom Brocke and Rosemann, 2015; Schäfer et al., 2013). This is particularly important for novice users (e.g., students, graduates) who may have limited knowledge or expertise about an organisational process and therefore require support in their process execution (Moreno et al., 2020). 214
A process framework for big data research 215 The context of this study is the credit network banks and organizations in India. We apply social network analysis whereby social phenomena are represented and studied by data on overlapping dyads as the units of observation (Brandes et al., 2013). Social network analysis consists of a series of mathematical techniques that, using network and graph theories, can be used to understand the structure and the dynamics of complex networks (Pallavicini et al., 2017). A complex network is a system for which it is difficult to reduce the number of parameters without losing its essential global functional properties (Costa et al., 2007). Numerous tools have been developed to fulfil the task of analyzing and describing complex social networks (Kim and Hastak, 2018; Valeri and Baggio, 2021). Social network analysis has been used in a range of contexts including disaster management (Kim and Hastak, 2018), tourism management (Valeri and Baggio, 2021), conspiracy theories about Covid-19 and 5G (Ahmed et al., 2020), disease ecology (Albery et al., 2021), migration and transnationalism (Bilecen et al., 2018), and online collaborative learning (Saqr et al., 2018). Despite the large body of literature addressing the topic of social network analysis, there is a noticeable absence of process frameworks that can guide novice researchers and practitioners. We address this gap in knowledge by proposing a design-based process framework to guide novice and experienced researchers and practitioners in the use of big data. We ground our framework on two streams of literature, namely social network analysis and design science research (DSR). In our research project we follow the DSR approach and address the following research aim: R1: To design a process framework for the effective application of social network analysis in big data research projects. Design science research (DSR) is a problem-solving paradigm that seeks to ‘design and evaluate’ innovative artifacts (e.g., concepts, models, methods, and instantiations) with the desire to improve an environment, by introducing the artifact and associated processes for creating it (Holmström et al., 2009; March and Smith, 1995; Hevner et al., 2004). While several process models have been proposed for DSR projects (e.g., Nunamaker et al., 1991; Walls et al., 1992; Hevner, 2007; Kuechler and Vaishnavi, 2008) we adopt the model proposed by Peffers et al. (2007) as it is the mostly widely cited DSR model (Vom Brocke et al., 2020) and although it is presented in a nominally sequential order, it is iterative in practice (Peffers et al., 2007). DSR is about understanding and improving the search among potential components to construct an artifact that is intended to solve a real-world problem (Baskerville, 2008). Essentially, DSR addresses ‘wicked problems’, or using Simon’s (1973) terminology, ‘ill-structured’ (Brooks Jr, 2010; Rittel and Webber, 1974), which are ‘decision situations where decision-makers may not know or agree on the goals of the decision, and even if the goals are known, the means by which these goals are achieved are not known and requisite solution designs to solve the problem may not even exist’ (Holmström et al., 2009, p. 67). The remainder of this chapter is structured as follows. First, a synthesis of key literature related to social network analysis and the principles and structure of complex networks is presented. Next, justification for adopting a design science research methodology is provided. Then, a rich context of the financial credit networks of Indian banks and organizations is provided, followed by a discussion about the proposed process framework and implications for research and practice. The chapter ends with a conclusion.
216 Handbook of big data research methods
2.
REVIEW OF SOCIAL NETWORK THEORY
A social network is a collection of actors (nodes) that include people and organizations linked by a collection of social relations (Laumann et al., 1978). It is widely employed in the social sciences, behavioral sciences, political science, economics, organizational science and industrial engineering (Garton et al., 1999). The fundamental components of a social network study are the actor (node) and the connection (link). The nodes can be individuals, corporates or groups and other social units and the nodes are linked to each other by ties (Wasserman and Faust, 1994). As a geographical map describes the landscape, networks offer a tantalizing tool to model the complex systems existing in the real world. Network theoretic modeling and visualization helps in managing and apprehending the enormity of complex systems. Networks help in understanding the basic patterns of interactions within the components and thus aid in understanding the complexity in real-world systems (Boccaletti et al., 2006). For instance, banking systems of several countries including India, Peru, Italy, Mexico and the US have captured accurate repositories of their interbank network for systemic risk analysis (Bargigli et al., 2015; Cuba et al., 2021; Gupta and Kumar 2021; Soramäki et al., 2007). In these networks (generally referred as complex networks), the connection patterns between the nodes are neither purely regular nor random – they are complex (Fortunato, 2010; Soramäki et al., 2007). The goal of modeling and analyzing complex networks is to reproduce the observed collective behavior in the real world by simplifying the rules of interaction between the components constrained in the network. 2.1
Properties and Structure of Complex Networks
Complex networks and their implications on dynamical processes form a broad area of study. Some of the most important research areas in complex networks are related to models of networks, structural properties of networks, module discovery in networks, motif discovery in networks, link prediction in networks and visual representation of networks. The properties that characterize the structural aspects of complex networks are presence of giant component, small world effect, scale-freeness, high clustering coefficient and presence of modular structure, which are discussed next. Presence of giant component: the real world complex networks either contain a giant component or they are fully connected (Barabási, 2014). A giant component containing a finite fraction of all the nodes emerges if the average degree represented by d is greater than 1 (Barabási, 2014). However, all the nodes of a network are absorbed by the giant component if the average degree d is greater than ln N where N is the number of nodes in a network (Barabási, 2014). Though many real world networks such as the internet and power grid do not satisfy the criteria of being fully connected (Barabási, 2014; Pagani and Aiello, 2013), the social network of humans in the world with a population of around 7.5 billion satisfies the criteria of being fully connected as average degree d ≈1000 is greater than ln(7.5×109 ) ≈ 22.73 (Barabási, 2014). Presence of small world effect: complex networks are characterized by the small world phenomenon implying that any two randomly selected nodes in a network are connected within short distances or hops (Watts and Strogatz, 1998). In practical terms, the small world effect has been manifested as ‘six degrees of separation’, meaning that between any two indi-
A process framework for big data research 217 viduals even on the opposite side of the globe there exists a path of at most six acquaintances (Travers and Milgram, 1969). Considering a network with average degree d and N number of nodes, the small world effect can be explained by the following calculation. Any node in this network is on an average connected to: d nodes within 1 hop. d
2
nodes within 2 hops.
d
3
nodes within 3 hops.
h
nodes within h hops.
….
d
Precisely, the expected number of nodes up to distance h from starting node can be formulated as:
E h 1 d d d d 2
3
h
d
h1
1
d 1
d
h
(14.1)
Assuming that the maximum number of hops or diameter of the network is hmax and given that the total number of nodes in the network is N , it can be mathematically expressed as: E hmax � N (14.2) d
hmax
hmax ≈
≈� N (14.3)
ln N ln d
(14.4)
Thus, equation (14.4) represents the mathematical formulation of the small world effect and also offers a good approximation for average path length h between any two nodes in a complex network (Barabási, 2014). Since ln N N , the dependence of diameter ( hmax ) or average path length ( h ) on ln N implies that distances in real networks are much smaller than the size of the system. Moreover, the denominator ln d implies that the denser the network, the smaller the average distance between the nodes of the network. Scale-free property of complex networks: the degree distribution of nodes in random networks is Poisson such that the degree of each node is typically given by � d ≈ � d (Barabási and Albert, 1999). See Figure 14.1(a), which illustrates the Poisson degree distribution in random
218 Handbook of big data research methods networks. On the other hand, complex networks have a statistically significant probability of each node having a much higher degree than the average degree d (Barabási and Albert, 1999). Therefore, complex networks are free of a characteristic scale and are called scale-free networks ((Barabási and Albert, 1999). The degree distribution of nodes in these networks is given by the power law:
Pd d (14.5) where the value of γ lies approximately between 2 and 3, which implies that there are few hubs in a network that are highly connected and dominate the topology of the network (Dorogovtsev and Mendes, 2002). See Figure 14.1(b), which illustrates the power law degree distribution in complex networks. Power law degree distribution has been observed in several real-world networks such as the internet, world-wide-web, international trade networks and citation networks (Rosvall, 2006). In a non-network context, power law has been observed in the rank of word frequencies, size of cities and distribution of incomes (Zipf, 1949). Clustering coefficient: measures the extent to which nodes in a network tend to cluster together (Boccaletti et al., 2006). Intuitively, clustering coefficient represents the probability that two connections of a person relate to each other in a social network. It has two versions: first based on global aspect of clustering in the network and second based on local (node-wise) indication of clustering (Boccaletti et al., 2006).
Figure 14.1
(a) Poisson degree distribution in random networks; (b) Poisson degree distribution in complex networks
The global clustering coefficient is the measure of probability that two adjacent neighbors of a node are also adjacent to each another (Newman, 2003). This relation leads to the formation of triangles within the network. Thus, the higher the number of triangles within a network, the higher the clustering coefficient. Mathematically, it can be represented as: GC 3
N (14.6) N
A process framework for big data research 219 where N ∆ denotes the number of triangles wherein each of the three nodes is connected to remaining two nodes. N Λ denotes the number of connected triplets wherein at least one node is connected to the other two. The multiplication factor of three indicates that each triangle forms three connected triplets and the value of GC lies between 0 and 1 (Newman, 2003). Unlike GC , which is dependent on the global properties of the network, the local clustering coefficient LCi of a node ni is given by the ratio of links existing between the nodes in the adjacent neighborhood of ni � divided by the number of possible links between them (Watts and Strogatz, 1998). Mathematically, this notion can be formulated as:
LCi 2
l jk :n j ,nk Ni ,l jk L di di 1
(14.7)
where N ni denotes the neighborhood subset of node ni and di is the degree of node ni . The local clustering coefficient LC is then determined by taking the average of local clustering coefficients of all the nodes in a network, as given by the following formula: 1 LC N
N
LC (14.8) i
i1
Although there are several other measures to reveal the structure of real-world networks, a network can be considered as complex if the number of links present in the network are far less than the total possible number of links within it. The average node degree is greater than 1 and the power law exponent is greater than 2 (Barabási, 2014). Moreover, the clustering coefficient of the complex network should be much higher than that of the corresponding random network and the average path length should be reasonably close to that of the corresponding random network (Albert and Barabási, 2002). Modular structure: a high clustering coefficient gives an indication about the network topology and presence of clusters of nodes in a network. These clusters of nodes have been referred to as cohesive subgroups, modules, or complexes, depending on the context and research discipline. In today’s digital world wherein networks are increasingly being mapped, modules can be viewed as groups of humans, places, banks, photos, events, web pages or any other real world entity. In unipartite social networks, the clustering of nodes has been studied theoretically as homophily, one of the underlying tenets of social network theory (McPherson et al., 2001). Homophily is the tendency of individuals to mingle with other individuals of their own kind. This tendency may be induced by preference such as gender and ethnicity or by constraints such as organization and educational standards. ‘Modularity’ is a network-level measure to determine the degree of homophily or goodness of modular structure in a complex network (Newman, 2006). Mathematically, modularity can be expressed by the following formula (Clauset et al., 2004): Q
1 2L
A vw
vw
dv d w Cv ,Cw (14.9) 2 L
220 Handbook of big data research methods where L is the total number of edges in a network, d v is the degree of the node nv and d w
is the degree of node nw . Avw is the adjacency matrix in which Cv ,Cw 1 if nv and nw are in the same module and 0 otherwise. 2.2 Modules in unipartite and bipartite networks Modelling real-world complex networks wherein data is interweaved in the form of nodes and links is one of the main research goals of big data analytics (Chang, 2018; Hu and Zhang, 2017). Graph theory, one of the most cited theories in business and information systems offers conceptual guidance to model and analyze the interactions in complex networks (Houy and Jouneau-Sion, 2016). A network consisting of a set of nodes and a set of links that join pairs of nodes is said to be unipartite or a one-mode network. On the other hand, when a network consists of two different sets of nodes and a set of links where each link joins nodes in different sets, the network is referred to as bipartite or a two-mode network (Gupta and Kumar, 2016; Huang and Gao, 2014). Previously, researchers have made persistent efforts to investigate and infer modular patterns in complex networks. For example, two banks may belong to the same group if they belong to the same (private or public) sector. In bipartite networks, the intuition behind modules can be developed by considering a set of nodes as banks in a banking system and another set of nodes as the firms where a bank–firm link exists if the firm has borrowed from a bank. Two banks are similar in terms of their credit relationships if they have provided loans to the same firms (Gupta and Kumar, 2021). Similarly, in an event–participant bipartite network, individuals who participate in similar events are more likely to be associated with each other (Davis et al., 2009). Thus, common neighborhoods on one side of the bipartite network reflect the nodes belonging to the same cohesive subgroup on the other side and vice versa. Previously, identification of modules in bipartite networks has been used for various applications such as mapping ontologies (Fonseca, 2003) and analyzing users and content in social media (Grujic et al., 2009). Moreover, investigation of cohesive subgroups in complex networks has multifarious applications such as modeling of contagion (Agarwal et al., 2012), marketing and product development (Landherr et al., 2010).
3. METHODOLOGY 3.1
Background to DSR
DSR is rooted in Herbert Simon’s seminal work The Sciences of the Artificial (Simon, 1969). Interest in DSR has been growing across disciplines, notably engineering, computer science, and information systems (Baskerville, 2008). DSR is a ‘paradigm’ (Iivari, 2007) grounded in ‘discovery-through-design’ (Baskerville, 2008). DSR is “a lens or set of synthetic and analytical techniques and perspectives (complementing positivist, interpretive, and critical perspective) for performing research” (Vaishnavi and Kuechler, 2004, p. 1). DSR is about understanding and improving the search among potential components to construct an ‘artifact’ that is intended to solve a ‘real world’ problem (Baskerville, 2008). In this context, an artifact is broadly defined as constructs (e.g., the conceptual vocabulary and symbols of a domain), models (e.g., propositions or statements expressing relationships
A process framework for big data research 221 between constructs), methods (e.g., algorithms or a set of steps used to perform a task: how-to knowledge), and instantiations (e.g., the operationalisation of constructs, models, and methods) (Vaishnavi and Kuechler, 2004; March and Smith, 1995; Hevner et al., 2004). In contrast to design practice (routine design), a ‘knowledge using activity’ (e.g., the application of existing knowledge to organisational problems), DSR is a ‘knowledge producing activity’ that addresses important unsolved problems in unique or innovative ways or solved problems in more effective ways (March and Smith, 1995; Hevner et al., 2004). Although the iterations between design (development) and evaluate (experiment) are a significant difference between DSR and the theory-driven ‘behavioural science’ (Kuechler and Vaishnavi, 2008), both approaches share a common environment (e.g., people, organisations, and technology) (Silver et al., 1995). A paradigm difference between design science and behavioral science is that the former is a ‘problem understanding paradigm’ while the latter is a ‘problem solving paradigm’ (Niehaves and Stahl, 2006). As mentioned previously, we adopt the DSR model proposed by Peffers et al. (2007), which is explained in the next section. 3.2
Process Model Adopted in this DSR Project
We adopt the six-step process model (see Table 14.1) proposed by Peffers et al. (2007) as it is the mostly widely cited model (Vom Brocke et al., 2020), and although it is presented in a nominally sequential order, it is iterative in practice. In addition, there are four possible entry points for research, namely: (1) problem-centered approach (i.e., if the research idea resulted from observation of the problem or from suggested future research in a paper from a prior project); (2) objective-centered approach (i.e., by-product of consulting experiences whereby client expectations were not met); (3) design and development-centered approach (i.e., existence of an artifact that has not yet been formally thought through as a solution for the explicit problem domain in which it will be used); and (4) observing a solution (i.e., observing a practical solution that worked and the researchers working backwards to apply rigor to the process retroactively). The entry point for this research is the design and development-centered approach. Table 14.1
A six-step process for design science research
#
Step
Description (Peffers et al., 2007)
1
Problem identification and
Define the specific research problem and justify the value of a solution. Justifying the value
motivation
of a solution is important as it (1) motivates the researcher and the audience of the research to pursue the solution and to accept the results and (2) helps to understand the reasoning associated with the researcher’s understanding of the problem.
2
Define the objectives for
Infer the objectives of a solution from the problem definition and knowledge of what is feasible.
a solution
The objectives can be quantitative (e.g., terms in which a desirable solution would be better than existing ones) or qualitative (e.g., a description of how a new artifact is expected to support solutions to problems not hitherto addressed).
3
Design and development
4
Demonstration
Create the actual artifact by determining its functionality and architecture. In DSR, an artifact can include constructs, models, methods, or instantiations. Demonstrate the utility of the artifact to solve the problem. This could involve its use in experimentation, simulation, a case study, proof, or other appropriate activity.
222 Handbook of big data research methods #
Step
Description (Peffers et al., 2007)
5
Evaluation
Observe and measure how well the artifact supports a solution to the problem. At the end of this activity the researchers can decide whether to iterate back to step 3 to try to improve the effectiveness of the artifact or to continue to communication and leave further improvement to subsequent projects. The nature of the research venue may dictate whether such iteration is feasible or not.
6
Communication
Communicate the problem and its importance, the artifact, its utility and novelty, the rigor of its design, and its effectiveness to researchers and other relevant audiences (e.g., practitioners).
In the context of this study, we describe how each step as per the Peffers et al. (2004) model aligns the theoretical elements of social network analysis. Problem identification and motivation: we address the problem of how to discover patterns of interaction in a social network based on big data. According to Polites and Watson (2009), common objectives of a social network analysis include: ● Information flow analysis – to determine the direction and strength of information flows through the network, such as information that is passed from one actor to other actors within the network. ● Evaluation of actor prominence – determines the most influential actors within a network. ● Hierarchical clustering – used to identify cliques whose members are fully or almost fully connected, such as groups of actors that communicate highly with each other. ● Block modeling – aims at discovering the key links between different subgroups in the network such as actors that serve as information brokers across groups or subgroups. ● Calculation of structural equivalence measures – aims at discovering network members with similar characteristics, such as actors that correlate, thus can be considered alternatives for each other. Design and development: the design of a solution involves creating an artifact. According to March and Smith (1995), constructs, models, methods and instantiations are considered as the main artifactual types. Constructs refers to the ‘language’ developed to capture the problem and its conceptual solution. Models use this language to represent problems and solutions. Methods describe processes which provide guidance on how to solve problems. Instantiations are problem-specific aggregates of constructs, models and methods. At this stage, an artifact’s desired functionality and its architecture is determined as a prelude to the creation of the artifact. The resources required for moving from the objectives of a solution to the design of a solution include the knowledge of theory that links the objectives to the solution (Peffers et al., 2006). In social network analysis, the design of a solution is governed by social network theory. The key network concepts that organize research on network effects are centrality, cohesion, and structural equivalence (Liu et al., 2017). Demonstration: during the demonstration phase, the effectiveness of the artifact to solve one or more instances of the problem is illustrated through experimentation, simulation, proof of concept or through other accepted means. The illustration of the efficacy of a solution can also be in the form of a case study using a prototype (Fisher, 2007; Geerts and Wang, 2007). The knowledge base required at the demonstration stage is that of how to use the artifact to solve the identified instance of the problem (Fisher, 2007). Evaluation: the evaluation phase measures how well the artifact supports a solution to the problem. It involves comparing the observed results from the use of the artifact during demon-
A process framework for big data research 223 stration to the objectives of the solution. During the evaluation phase, the utility, quality and efficacy of a design artifact should be demonstrated by executing evaluation metrics (Arnott and Pervan, 2012). The metrics are useful in establishing the performance of the new artifact. Metrics for evaluation may be based on the artifact’s functionality, quantitative performance measures, satisfaction surveys, clients’ feedback, and simulations. In social network analysis, evaluation metrics include contingency heuristics. Communication: this phase entails communicating the problem and its importance, the artifact, its utility, novelty, and the rigor of its design. The communication is aimed at showing the effectiveness of the artifact to researchers as well as technology-oriented and management-oriented audiences. Effective communication of design artifacts requires the knowledge of the disciplinary culture. In this phase, we underscore how the artifact developed in this study can be applied by novice users (e.g., students, practitioners) to identify cohesiveness amongst actors in a social network, for example, credit relationships amongst banks and firms.
4.
CASE STUDY: CREDIT NETWORKS AMONGST INDIAN COMPANIES AND BANKS
In financial systems, interactions arising due to credit relationships could potentially lead to systemic risk (Gupta and Kumar, 2021). These systems can be modeled as bipartite networks consisting of two heterogeneous interacting agents (nodes) connected by credit relationships (links), as in a bank–firm credit network. The analysis of real credit networks reveals that these networks have the characteristics of complex social networks, that is, high clustering coefficient, power-law degree distribution, and modular structure (De Masi et al., 2011). Once a bankruptcy occurs at a particular node in the network, it may be promulgated more widely in the network, leading to systemic consequences (Gupta and Kumar, 2021). For instance, several business houses in the hospitality, aviation, and energy sector and fitness centers such as Virgin Atlantic, Gold’s Gym, Avianca, CMX Cinemas and Apex Parks went bankrupt during the COVID-19 pandemic following a forecast of a 35 percent increase in global insolvency index by Euler Hermes (a credit insurance company) during the June 2020– June 2022 period.1 Similarly, the 2008 global financial crisis leading to the insolvency of investment banks such as Lehmann Brothers, Merrill Lynch and Bear Stearns exposed the entwined nature of financial systems. Another insolvency proceeding was initiated in 2020 owing to Reliance Capital’s default of INR 1417 crore to Yes Bank, a private sector bank in India. This resulted in several other Indian banks such as State Bank of India, HDFC bank, ICICI bank, Axis bank, and Kotak Mahindra bank investing several crores in the bank while acquiring stakes in the bank. The effect of Yes Bank’s collapse had a contagion effect across the country, with stock market indices falling sharply and a growth in credit rates. Several similar bankruptcies have occurred in the past in different parts of the world, including impairment of Japan’s banking system in 1992 and the Greek bank in 2010, and the effect of these bankruptcies has imbued throughout the respective countries or even to other countries. Due to these disastrous incidents, policymakers and governments have made an immense effort to unravel the hidden risks 1 See https://www.firstpost.com/india/insolvency-cases-have-gone-up-substantially-in-covid-hit -corporate-world-but-india-can-heave-a-sigh-of-relief-10176331.html.
224 Handbook of big data research methods in complex financial systems. These credit relationships between banks and firms help banks in earning interest margins and firms in fueling their business growth. The multiple borrowing relationships hedge the companies against liquidation risk. On the other side, multiple lending relationships insure banks against a firm’s risk of failure. However, the propensity to form multiple or single relationships varies with internal and external conditions (De Masi et al., 2011). On the flip side, insolvencies, whether they be in banks or firms, reduce the appetite for risk of lenders and lead to an increase in lending rates. As the insolvencies can have a contagion effect, detection of modules is an effective decision support system for credit risk assessment in a financial system. This case study uses the data from annual reports of 20 heavily indebted companies in India to map the credit bipartite network of these companies and their bankers. Subsequently, modules of banks are identified using the concepts discussed in the review of social network analysis theory. 4.1
Data Collection
The data on the credit relationships between companies and banks in India is not available in an organized form at a single source. Therefore, we collected the data manually from annual reports of Indian companies for the financial year 2020–21. We first shortlisted the high debt non-financial companies from moneycontrol.com, which is a more than two-decades-old financial portal in India. This shortlisting process resulted in 20 companies. The industry type, total debt, and number of bankers for each company are shown in Table 14.2. Table 14.2
Details of companies used in creation of credit network
#
Company
Industry Type
Debt
No. of Bankers
1
Reliance
Conglomerate
1,97,403.00
17
2
NTPC
Power – Generation & Distribution
1,63,799.35
17
3
Power Grid Corp
Power – Generation & Distribution
1,45,415.99
8
4
ONGC
Oil Drilling and Exploration
77,065.12
1
5
IOC
Refineries
72,740.20
2
6
JSW Steel
Steel – Large
49,215.00
9
7
Indiabulls Housing
Housing Finance
34,136.17
28
8
HPCL
Refineries
33,003.40
9
9
Adani Ports
Infrastructure – General
31,570.43
18
10
BPCL
Refineries
31,314.82
9
11
NHPC
Power – Generation & Distribution
28,947.90
22
12
SAIL
Steel – Large
27,176.05
21
13
Jindal Steel
Steel
24,099.53
23
14
Alok Industries
Textiles
22,770.20
3
15
CESC
Power – Generation & Distribution
11,332.38
22
16
Future Retail
Retail
5,360.11
14
17
IRB Infra
Infrastructure – General
5,213.51
19
18
NLC India
Power – Generation & Distribution
13,365.62
5
19
Oil India
Oil Drilling and Exploration
15,398.83
4
20
Can Fin Homes
Housing Finance
5,552.62
1
As companies have borrowed from multiple banks, there are 56 banks in the dataset. We create a bipartite network such that each bank is linked to the companies to which it loaned money
A process framework for big data research 225 (see Table 14.3). This resulted in 20 nodes of companies on one side of the bipartite network connected to 56 banks on the other side. Description of credit network
Table 14.3 Characteristics
Value
Number of companies
20
Number of banks
56
Number of links
252
Average degree of companies
12.6
Average degree of banks
4.5
4.2
Analysis and Results
We applied the bipartite clustering approach to cluster the set of bank nodes in the bipartite credit network. The experimental results reveal six modules, consisting of 4, 9, 13, 6, 14 and 10 banks, as shown in Table 14.4. It is interesting to observe that the non-Indian origin banks fall into module 1 and module 6, except the Karnataka Bank and IDFC First Bank. Table 14.4
Modules of banks in credit network
Module 1
Module 2
Module 3
Module 4
Module 5
Module 6
ANZ Bank
Bank of India
Canara Bank
Axis Bank
Bank of Baroda
Bank of America
Catholic Syrian
Union Bank of
Axis Finance Bank SBI Bank
Punjab National Bank
Barclays Bank
Bank
India
Karnataka Bank
Standard Chartered Aditya Birla
HDFC Bank
India Overseas Bank
DZ Bank
IndusInd Bank
Central Bank of India
Bank Shinhan Bank
ICICI Bank
IDFC Bank
Germany Export-import bank
BNP Paribas
Jammu & Kashmir Federal Bank
IDBI Bank
Bank
CitiBank
Bank of
Hamburg Commercial Bank
PFCL Bank
Indian Bank
IDFC First
Maharashtra
Deutsche
EXIM
DBS Bank
Mizuho Bank
Credit Agricole
Punjab & Sind
Cooperative Bank
MUFG Bank
UCO Bank
JP Morgan Sumitomo
Bank
Hong Kong and
Kotak Mahindra
Shanghai
Bank
AU Small Finance
IIFCL
RBL Bank
Yes Bank
United Overseas
IFCI
Mitsui Bank
Bank
SBI Life Insurance
Union Bank
India Infra
Most of the private sector banks are identified as module 2. Modules 3, 4 and 5 are mainly composed of public sector banks. This modular organization clearly indicates the extent of interdependency among Indian banks arising due to lending to companies. Modules 3 and 5,
226 Handbook of big data research methods with 13 and 14 banks each, are the most critical modules for the Indian banking system and if a firm defaults to any of the banks in these clusters, adverse effects will spread to 13 banks (module 3) or 14 banks (module 5).
5.
DISCUSSION, IMPLICATIONS AND LIMITATIONS
From the outset, the aim of this DSR project was ‘to design a process framework for the effective application of social network analysis to improve the outputs of big data research projects’. Drawing on contemporary literature, we frame contributions of this study. A design science contribution must be ‘interesting’ to the research community (Gregor and Hevner, 2013), as well as be valued and accepted by the research community through its publication (Vaishnavi and Kuechler, 2004). By adopting a DSR approach, we ensure that this study produces an interesting framework that will be of value and accepted by both researchers and practitioners. Gregor and Hevner (2013, p. 345) propose a knowledge contribution framework for DSR which consists of four quadrants, namely: Improvement (develop new solutions for known problems, and Invention (Invent new solutions for new problems), and Routine Design (Apply known solutions for known problems), which would rarely be accepted as a research contribution (Vaishnavi and Keuchler, 2004), and Exaptation (Extend known solutions to new problems (e.g., adopt solutions from other fields). For improvement, invention and exaptation to be considered as a significant research contribution, ‘it must be judged as significant with respect to the current state of the knowledge in the research area and be considered interesting’ (Vaishnavi and Keuchler, 2004, p. 17). We believe the contribution of the proposed framework falls under the ‘invention’ quadrant as it provides a new solution to a new problem. The proposed artifact also presents a theoretical contribution, as its construction is a special case of predictive theory that provides a prescription which, when acted upon, causes an artifact of a certain kind to come into being (Gregor, 2006). The proposed process framework is an artifact aimed at actualizing the activities involved in social network analysis. Further, in an artifactual contribution, originality and novelty refer to the introduction of a particular artifact (Ågerfalk and Karlsson, 2021). The aim of validating the process framework using social network data of banks and firms within the Indian context provides evidence of ‘satisficeability’ of the artifactual contribution (Simon, 1969). Since the entry point of this research was at the ‘design and development’ phase, a rich description of the design process leading to the creation of the artifact and its instantiation using a case study of the credit networks of banks and firms in India provided empirical contribution. An empirical contribution captures data, measurements, observations or descriptions regarding the artifact (Ågerfalk and Karlsson, 2021). We acknowledge two limitations of this study, which also offer directions for future research. First, the study is based on a single case of module discovery problem within an umbrella of social network analysis which, by nature, limits generalizability (Yin, 2009). Second, the proposed framework has been developed based on the context of the social networks of credit networks of banks and firms in India, which is a highly regulated industry. Future research could test the applicability of frameworks in different contexts such as in non-regulated environments (e.g., tourism, education).
A process framework for big data research 227 The proposed process framework (see Figure 14.2) for conducting social network analysis using big data provides a clear formal structure in a form of predefined activities such that the underlying processes are mapped to the domain application and knowledge outputs. In accordance with the principles of the DSR process model (Dresch et al., 2015; Peffers et al., 2006, 2018; Gupta and Tiwari, 2021), the proposed framework first formally introduces the module discovery problem to impart familiarity with the importance and relevance of module discovery through real life examples. In the second step, an understanding of the various aspects of the problem such as credit risk, insolvency, modularity in networks is provided. The third step deals with the explanation of the origin and scientific advances related to credit risk assessment and solution of the module discovery problem. The fourth step puts forward multiple classes of problems associated with module discovery, such as module discovery in unweighted and unipartite networks (Kumar et al., 2017; Gupta and Deodhar, 2021), module discovery in weighted and unipartite networks (Gupta and Tiwari, 2021), module discovery in bipartite networks (Gupta and Kumar, 2021), which are highlighted in this step. The fifth step brings forth the concepts related to binarization, transformation, and one mode projection for module discovery. The sixth step pertains to demonstration of the superior approach through its implementation and heuristic such as similarity involved herein. Once the working of the solution is explained, in the seventh step, the concept of contingency heuristic (modularity) is discussed. Subsequently, the advantages of bipartite module discovery for credit risk assessment should be brought forward. Finally, how the similar approaches of bipartite modeling and module discovery could be used for other context and application domains should be highlighted.
Figure 14.2
Process framework for conducting social network analysis using big data
228 Handbook of big data research methods
6. CONCLUSION This chapter provided a brief overview of the value of big data analytics and social network analysis that requires mathematical and computational techniques that can be used to understand the dynamics of complex real-world and artificial networks. The chapter also highlighted the importance of understanding the processes required to leverage big data to create business value through data-driven decisions. The chapter then describes a process framework to guide novice users to apply social network analysis effectively and improve the outputs of big data research projects. This chapter therefore provides some interesting insights and opportunities for research and practice of social network analysis.
REFERENCES Agarwal, S., Benmelech, E., Bergman, N. and Seru, A. (2012). Did the Community Reinvestment Act (CRA) lead to risky lending? (No. w18609). National Bureau of Economic Research. Ågerfalk, P.J. and Karlsson, F. (2021). Theoretical, empirical, and artefactual contributions in information systems research: Implications implied. In N.R. Hassan and L.P. Willcocks (eds), Advancing Information Systems Theories: Rationale and Processes (pp. 53–73). Cham: Springer International Publishing. Ahmed, W., Vidal-Alaball, J., Downing, J. and Seguí, F.L. (2020). COVID-19 and the 5G conspiracy theory: Social network analysis of Twitter data. Journal of Medical Internet Research, 22(5), e19458. Albert, R. and Barabási, A.-L. (2002). Statistical mechanics of complex networks. Reviews of Modern Physics, 74(1), 47. Albery, G.F., Kirkpatrick, L., Firth, J.A. and Bansal, S. (2021). Unifying spatial and social network analysis in disease ecology. Journal of Animal Ecology, 90(1), 45–61. Arnott, D. and Pervan, G. (2012). Design science in decision support systems research: An assessment using the Hevner, March, Park, and Ram guidelines. Journal of the Association for Information Systems, 13(11). Accessed at https://doi.org/10.17705/1jais.00315. Avital, M. and Te’Eni, D. (2009). From generative fit to generative capacity: Exploring an emerging dimension of information systems design and task performance. Information Systems Journal, 19(4), 345–67. Barabási, A.L. (2014). Network Science. Cambridge: Cambridge University Press. Barabási, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. Science, 286(5439), 509–12. Bargigli, L., Di Iasio, G., Infante, L., Lillo, F. and Pierobon, F. (2015). The multiplex structure of interbank networks. Quantitative Finance, 15(4), 673–91. Baskerville, R. (2008). What design science is not. European Journal of Information Systems, 17(5), 441–43. Accessed at https://doi.org/10.1057/ejis.2008.45. Bilecen, B., Gamper, M. and Lubbers, M.J. (2018). The missing link: Social network analysis in migration and transnationalism. Social Networks, 53, 1–3. Boccaletti, S., Latora, V., Moreno, Y., Chavez, M. and Hwang, D. (2006). Complex networks: Structure and dynamics. Physics Reports, 424(4–5), 175–308. Borgatti, S.P. and Foster, P.C. (2003). The network paradigm in organizational research: A review and typology. Journal of Management, 29(6), 991–1013. Brandes, U., Robins, G., McCranie, A. and Wasserman, S. (2013). What is network science? Network Science, 1(1), 1–15. Brooks Jr, F.P. (2010). The Design of Design: Essays from a Computer Scientist. Pearson Education. Chang, V. (2018). A proposed social network analysis platform for big data analytics. Technological Forecasting and Social Change, 130, 57–68. Choi, T.M., Wallace, S.W. and Wang, Y. (2018). Big data analytics in operations management. Production and Operations Management, 27(10), 1868–83.
A process framework for big data research 229 Clauset, A., Newman, M.E. and Moore, C. (2004). Finding community structure in very large networks. Physical Review E, 70(6), 066111. Costa, L.D.F., Rodrigues, F.A., Travieso, G. and Villas Boas, P.R. (2007). Characterization of complex networks: A survey of measurements. Advances in Physics, 56(1), 167–242. Cuba, W., Rodriguez-Martinez, A., Chavez, D.A., Caccioli, F. and Martinez-Jaramillo, S. (2021). A network characterization of the interbank exposures in Peru. Latin American Journal of Central Banking, 2(3), 100035. Davenport, T.H. (2018). From analytics to artificial intelligence. Journal of Business Analytics, 1(2), 73–80. Accessed at https://doi.org/10.1080/2573234X.2018.1543535. Davis, A., Gardner, B.B. and Gardner, M.R. (2009). Deep South: A Social Anthropological Study of Caste and Class. Chicago, IL: University of South Carolina Press. De Masi, G., Fujiwara, Y., Gallegati, M., Greenwald, B. and Stiglitz, J.E. (2011). An analysis of the Japanese credit network. Evolutionary and Institutional Economics Review, 7, 209–32. Accessed at https://doi.org/10.14441/eier.7.209. Dennehy, D., Oredo, J., Spanaki, K., Despoudi, S. and Fitzgibbon, M. (2021). Supply chain resilience in mindful humanitarian aid organizations: The role of big data analytics. International Journal of Operations & Production Management, 41(9), 1417–41. Accessed at https://doi.org/10.1108/IJOPM -12-2020-0871. Dorogovtsev, S.N. and Mendes, J.F. (2002). Evolution of networks. Advances in Physics, 51(4), 1079–187. Dresch, A., Lacerda, D.P. and Antunes, J.A.V. (2015). Design science research. In Design Science Research (pp. 67–102). Cham: Springer. Fisher, I.E. (2007). A prototype system for temporal reconstruction of financial accounting standards. International Journal of Accounting Information Systems, 8(3), 139–64. Accessed at https://doi.org/ 10.1016/j.accinf.2007.07.001. Fonseca, Y.C.F. (2003). A bipartite graph co-clustering approach to ontology mapping. In Proceedings of the Workshop on Semantic Web Technologies for Searching and Retrieving Scientific Data. Colocated with the Second International Semantic Web Conference (ISWC-03). Fortunato, S. (2010). Community detection in graphs. Physics Reports, 486(3), 75–174. Fosso Wamba, S., Akter, S., Edwards, A., Chopin, G. and Gnanzou, D. (2015). How ‘big data’ can make big impact: Findings from a systematic review and a longitudinal case study. International Journal of Production Economics, 165, 234–46. Accessed at https://doi.org/10.1016/j.ijpe.2014.12.031. Fosso Wamba, S., Gunasekaran, A., Akter, S., Ren, S.J., Dubey, R. and Childe, S.J. (2017). Big data analytics and firm performance: Effects of dynamic capabilities. Journal of Business Research, 70, 356–65. Gandomi, A. and Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2), 137–44. Accessed at https://doi.org/10 .1016/j.ijinfomgt.2014.10.007. Garton, L., Haythornthwaite, C. and Wellman, B. (1999). Studying on-line social networks. In Doing Internet Research: Critical Issues and Methods for Examining the Net (pp. 75–106). Thousand Oaks, CA: SAGE Publications. Accessed at https://doi.org/10.4135/9781452231471. Geerts, G.L. and Wang, H.J. (2007). The timeless way of building REA enterprise systems. Journal of Emerging Technologies in Accounting, 4(1), 161–82. Accessed at https://doi.org/10.2308/jeta.2007 .4.1.161. Ghasemaghaei, M., Ebrahimi, S. and Hassanein, K. (2018). Data analytics competency for improving firm decision making performance. The Journal of Strategic Information Systems, 27(1), 101–13. Gregor, S. (2006). The nature of theory in information systems. Management Information Systems Quarterly, 30(3). Accessed at https://aisel.aisnet.org/misq/vol3. Gregor, S. and Hevner, A.R. (2013). Positioning and presenting design science research for maximum impact. MIS Quarterly, 37(2), 337–55. Accessed at https://www.jstor.org/stable/43825912. Grover, P. and Kar, A.K. (2017). Big data analytics: A review on theoretical contributions and tools used in literature. Global Journal of Flexible Systems Management, 18(3), 203–29. Grover, V., Chiang, R.H., Liang, T.P. and Zhang, D. (2018). Creating strategic business value from big data analytics: A research framework. Journal of Management Information Systems, 35(2), 388–423.
230 Handbook of big data research methods Grujic, J., Mitrovic, M. and Tadic, B. (2009). Mixing patterns and communities on bipartite graphs on web-based social interactions. In 16th International Conference on Digital Signal Processing (pp. 1–8). IEEE. Gupta, S. and Deodhar, S. (2021). Understanding digitally enabled complex networks: A plural granulation based hybrid community detection approach. Information Technology & People. Gupta, S. and Kumar, P. (2016). Community detection in heterogenous networks using incremental seed expansion. In 2016 International Conference on Data Science and Engineering (ICDSE) (pp. 1–5). IEEE. Gupta, S. and Kumar, P. (2021). A constrained agglomerative clustering approach for unipartite and bipartite networks with application to credit networks. Information Sciences, 557, 332–54. Gupta, S. and Tiwari, A.A. (2021). A design-based pedagogical framework for developing computational thinking skills. Journal of Decision Systems, 1–18. Hevner, A.R. (2007). A three cycle view of design science research. Scandinavian Journal of Information Systems, 19(2), 4. Hevner, A., March, S., Park, J. and Ram, S. (2004). Design science in information systems research. Management Information Systems Quarterly, 28(1). Accessed at https://aisel.aisnet.org/misq/vol28/ iss1/6. Hirschheim, R. (2021). The attack on understanding: How big data and theory have led us astray: A comment on Gary Smith’s Data Mining Fool’s Gold. Journal of Information Technology, 36(2), 176–83. Holmström, J., Ketokivi, M. and Hameri, AP. (2009). Bridging practice and theory: A design science approach. Decision Sciences, 40, 65–87. Houy, N. and Jouneau-Sion, F. (2016). Defaulting firms and systemic risks in financial networks. Available at SSRN 2727693. Huang, Y. and Gao, X. (2014). Clustering on heterogeneous networks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(3), 213–33. Hu, J. and Zhang, Y. (2017). Discovering the interdisciplinary nature of Big Data research through social network analysis and visualization. Scientometrics, 112(1), 91–109. Iivari, J. (2007). A paradigmatic analysis of information systems as a design science. Scandinavian Journal of Information Systems, 19(2), 5. Kim, J. and Hastak, M. (2018). Social network analysis: Characteristics of online social networks after a disaster. International Journal of Information Management, 38(1), 86–96. Kuechler, B. and Vaishnavi, V. (2008). On theory development in design science research: Anatomy of a research project. European Journal of Information Systems, 17(5), 489–504. Kumar, P., Gupta, S. and Bhasker, B. (2017). An upper approximation based community detection algorithm for complex networks. Decision Support Systems, 96, 103–18. Landherr, A., Friedl, B. and Heidemann, J. (2010). A critical review of centrality measures in social networks. Business & Information Systems Engineering, 2(6), 371–85. Laumann, E.O., Galaskiewicz, J. and Marsden, P.V. (1978). Community structure as interorganizational linkages. Annual Review of Sociology, 4, 455–84. Accessed at https://www.jstor.org/stable/2945978. Liu, W., Sidhu, A., Beacom, A. and Valente, T. (2017). Social network theory. In P. Rossler, H. Cynthia and Z. van Liesbet (eds), Social Networks and Macrosocial Change. Wiley. Accessed at https://doi .org/10.1002/9781118783764.wbieme0092. March, S.T. and Smith, G.F. (1995). Design and natural science research on information technology. Decision Support Systems, 15(4), 251–66. Accessed at https://doi.org/10.1016/0167-9236(94)00041 -2. McPherson, M., Smith-Lovin, L. and Cook, J.M. (2001). Birds of a feather: Homophily in social networks. Annual Review of Sociology, 27, 415–44. Mikalef, P., Boura, M., Lekakos, G. and Krogstie, J. (2019). Big data analytics and firm performance: Findings from a mixed-method approach. Journal of Business Research, 98, 261–76. Mikalef, P., Framnes, V.A., Danielsen, F., Krogstie, J. and Olsen, D. (2017). Big data analytics capability: antecedents and business value. PACIS 2017 Proceedings, 136. Modgil, S., Gupta, S., Sivarajah, U. and Bhushan, B. (2021). Big data-enabled large-scale group decision making for circular economy: An emerging market context. Technological Forecasting and Social Change, 166, 120607.
A process framework for big data research 231 Moreno, E.A., Cerri, O., Duarte, J.M., Newman, H.B., Nguyen, T.Q., Periwal, A., Pierini, M. et al. (2020). JEDI-net: A jet identification algorithm based on interaction networks. The European Physical Journal C, 80, 1–15. Newman, M.E. (2003). The structure and function of complex networks. SIAM Review, 45(2), 167–256. Newman, M.E. (2006). Modularity and community structure in networks. Proceedings of the National Academy of Sciences, 103(23), 8577–82. Niehaves, B. and Stahl, B.C. (2006). Criticality, epistemology and behaviour vs. design – information systems research across different sets of paradigms. ECIS 2006 Proceedings. 166. Accessed at https:// aisel.aisnet.org/ecis2006/166. Nunamaker, J.F., Chen, M. and Purdin, T.D. (1991). Systems development in information systems research. Journal of Management Information Systems, 7(3), 89–106. Pagani, G.A. and Aiello, M. (2013). The power grid as a complex network: A survey. Physica A: Statistical Mechanics and its Applications, 392(11), 2688–700. Pallavicini, F., Cipresso, P. and Mantovani, F. (2017). Beyond sentiment: How social network analytics can enhance opinion mining and sentiment analysis. In F.A. Pozzi, E. Fersini, E. Messina and B. Liu (eds), Sentiment Analysis in Social Networks (pp. 13–29). Cambridge, MA: Morgan Kaufmann. Peffers, K., Tuunanen, T. and Niehaves, B. (2018). Design science research genres: Introduction to the special issue on exemplars and criteria for applicable design science research. European Journal of Information Systems, 27(2), 129–39. Accessed at https://doi.org/10.1080/0960085X.2018.1458066. Peffers, K., Tuunanen, T., Rothenberger, M.A. and Chatterjee, S. (2007). A design science research methodology for information systems research. Journal of Management Information Systems, 24(3), 45–77. Peffers, K., Tuunanen, T., Gengler, C.E., Rossi, M., Hui, W., Virtanen, V. and Bragge, J. (2006). The design science research process: A model for producing and presenting information systems research. Accessed at https://jyx.jyu.fi/handle/123456789/63435. Polites, G. and Watson, R. (2009). Using social network analysis to analyze relationships among IS journals. Journal of the Association for Information Systems, 10(8). Accessed at https://doi.org/10 .17705/1jais.00206. Rittel, H.W. and Webber, M.M. (1974). Wicked problems. Man-made Futures, 26(1), 272–80. Rosvall, M. (2006). Information horizons in a complex world. Doctoral dissertation, Department of Physics, Umeå University, Umeå. Saqr, M., Fors, U., Tedre, M. and Nouri, J. (2018). How social network analysis can be used to monitor online collaborative learning and guide an informed intervention. PloS one, 13(3), e0194777. Schäfer, R., Barbaresi, A. and Bildhauer, F. (2013). The good, the bad, and the hazy: Design decisions in web corpus construction. In Proceedings of the 8th Web as Corpus Workshop, July, Lancaster, UK. (pp. 7–15). Seddon, J.J. and Currie, W.L. (2017). A model for unpacking big data analytics in high-frequency trading. Journal of Business Research, 70, 300–307. Silver, M.S., Markus, M.L. and Beath, C.M. (1995). The information technology interaction model: A foundation for the MBA core course. MIS Quarterly, 19(3), 361–90. Simon, H. (1969). The Sciences of the Artificial. Cambridge, MA: MIT Press. Simon, H.A. (1973). The structure of ill structured problems. Artificial Intelligence, 4(3–4), 181–201. Soramäki, K., Bech, M.L., Arnold, J., Glass, R.J. and Beyeler, W.E. (2007). The topology of interbank payment flows. Physica A: Statistical Mechanics and its Applications, 379(1), 317–33. Travers, J. and Milgram, S. (1969). An experimental study of the small world problem. Sociometry, 32(4), 425–43. Vaishnavi, V. and Kuechler, W. (2004/21). Design science research in information systems. 20 January, 2004 (updated in 2017 and 2019 by V. Vaishnavi and P. Stacey); last updated 24 November 2021. Accessed at URL: http://www.desrist.org/design-research-in-information-systems/. Valeri, M. and Baggio, R. (2021). Social network analysis: Organizational implications in tourism management. International Journal of Organizational Analysis, 29(2), 342–53. Vom Brocke, J. and Rosemann, M. (2015). Handbook on Business Process Management 1: Introduction, Methods, and Information Systems. Springer. Vom Brocke, J., Hevner, A. and Maedche, A. (2020). Introduction to Design Science Research. In Design Science Research. Cases (pp. 1–13). Cham: Springer.
232 Handbook of big data research methods Walls, J.G., Widmeyer, G.R. and El Sawy, O.A. (1992). Building an information system design theory for vigilant EIS. Information Systems Research, 3(1), 36–59. Accessed at https://doi.org/10.1287/isre .3.1.36. Wasserman, S. and Faust, K. (1994). Social Network Analysis: Methods and Applications. Cambridge University Press. Watson, H. (2014). Tutorial: Big data analytics: Concepts, technologies, and applications. Communications of the Association for Information Systems, 34(1). Accessed at https://doi.org/10.17705/1CAIS.03462. Watts, D.J. and Strogatz, S.H. (1998). Collective dynamics of ‘small-world’ networks. Nature, 393(6684), 440–42. Yin, R.K. (2009). Case Study Research: Design and Methods. Thousand Oaks, CA: Sage Publications. Zipf, G.K. (1949). Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Ravenio Books.
15. Notre-Dame de Paris cathedral is burning: let’s turn to Twitter Serge Nyawa, Dieudonné Tchuente and Samuel Fosso Wamba
1. INTRODUCTION Social media are now becoming key technological tools to collect and disseminate information during important events such as festivals (Stieglitz et al., 2018), disaster management (Kim and Hastak, 2018; Martinez-Rojas et al., 2018; Stieglitz et al., 2018), politics (Baxter and Marcella, 2017; Stieglitz et al., 2018), and marketing (Rathore et al., 2018). Among the interactive computer-mediated technologies that facilitate the creation and sharing of information, Twitter holds a leadership position (Martinez-Rojas et al., 2018; Stieglitz et al., 2018). Created in March 2006 by Jack Dorsey, Noah Glass, Biz Stone and Evan Williams, Twitter is an online news and social networking service. Users, known as “twitterers”, can post, like, or retweet messages known as “tweets”. Initially restricted to 140 characters, the text width of a tweet can nowadays reach 280. Globally, this social network has around 100 million daily active users, who posts an average 500 million tweets per day. Registered users mostly rely on their mobile phone to interact with the platform (80 per cent). The success of Twitter can be explained by many factors, the most important of which are: the audience increase, instant communication, real-time information, and direct support for response efforts (Martinez-Rojas et al., 2018). In addition, Twitter is very convenient for socializing. Artists can use it to update their fans by posting links to YouTube videos. Thus, it is a good channel for entertainment. Companies can rely on it as a channel for news and promotions. Twitter is an easy way for finding people willing to support a specific cause. Considering these properties, Twitter can be seen as a powerful mechanism for faster information dissemination. Effective and efficient dissemination of information can be critical in situations where human lives or patrimony are at risk. Examples include disasters. UNISDR (United Nations Office for Disaster Risk Reduction) considers disasters as a serious disruption of the functioning of a community or a society involving widespread human, material, economic or environmental losses and impacts, which exceeds the ability of the affected community or society to cope using its own resources. When a natural process causes a loss of lives, injury, property damage, loss of livelihoods or services, the disaster is said to be “natural” (e.g., earthquakes, landslides, volcanic eruptions, floods, hurricanes, tornadoes, blizzards, tsunamis, or cyclones). Apart from natural disasters of an environmental, hydrological, climatological or geophysical nature, there are also biological disasters such as insect infestations, epidemics and animal attacks. Otherwise, the disaster is a “man-made” disaster (e.g., stampedes, fires, transport accidents, industrial accidents, oil spills, nuclear explosions/nuclear radiations, wars). A disaster naturally leads to an emergency situation. 233
234 Handbook of big data research methods The use of Twitter as a source of information during an emergency is nowadays common (Son et al., 2019; Martinez-Rojas et al., 2018). Users can post recent information or can share crucial news with their followers by retweeting received information. This information is particularly important for disaster management: while governmental agencies or private organizations can rely on generated information to optimize their interventions, communities cannot reduce disaster-related human and economic impacts, organize their evacuation procedures or disclose their urgent requests (how they can be helped) without relying on such raw materials. There is a growing body of literature trying to understand how twitter-based information is transmitted during disasters. Examples include topics on disaster-related information retrieval and propagation, behavioural modelling, disaster surveillance, social support, and mental health management (Chen et al., 2014; Gruebner et al.; 2016; Lu and Yang, 2012; Imran et al., 2015). However, available data show that natural disasters (earthquakes, floods, hurricanes, tornadoes, tsunamis, cyclones, etc.) have so far lured the attention of a greater number of researchers and scholars, and that only very few of them have investigated “man-made” disasters. During a man-made disaster, sharing updated information is crucial. Loss of life and property damage can be avoided when information is shared rapidly. Thus, it is crucial to understand factors that could facilitate the sharing of information in the advent of a large-scale disaster such as a huge fire. Investigating information generation and sharing during a man-made disaster of a considerable magnitude is the major objective of this chapter. More precisely, this study draws on prior studies on social media use and adoption during fire and rescue to study the communication performance of fire disaster tweets. Since some characteristics of tweets predict greater information propagation (Yang and Counts, 2010), we are interested in understanding how tweets’ features do influence the information timeliness during a fire disaster. We will focus on a recent famous fire disaster, namely the fire incident that recently occurred at the Notre-Dame de Paris Cathedral. Our research question is set forth as follows: RQ:
How do tweets’ features influence the information timeliness during a fire disaster?
To address this question, this study draws on prior studies on social media adoption and use for disaster management, with a main focus on Twitter, and data related to the Notre-Dame de Paris fire disaster that are being collected on the Twitter platform. The rest of this chapter is structured as follows: the next section presents a literature review on disaster management with Twitter. The subsequent section presents our research hypotheses, followed by another section describing our research methodology. Another section discusses data analysis, while the next presents results and a discussion. The last section serves as a conclusion, coupled with future research directions.
2.
LITERATURE REVIEW
Disasters now represent a major threat to local, regional and national governments worldwide. The important number of disasters in the last decades has made disaster management a major point of interest and research (Bagloee et al., 2019). Disasters can be caused by a natural event, an animal or a human being. They may include floods, wars, droughts, solar flares, cosmic explosions and meteorites, earthquakes, landslides, tsunamis, hurricanes, and terrorist attacks
Notre-Dame de Paris cathedral is burning: let’s turn to Twitter 235 (Teodorescu, 2015). Disasters could cause huge economic and human losses to affected areas. For example, it was estimated that 1.3 million people were killed between 1998 and 2017 by climate-related and geophysical disasters. In the same period, 4.4 billion people were injured, homeless, displaced or in need of emergency assistance because of these disasters (Wallemacq and House, 2018). The direct economic losses from these disasters were estimated at US$2908 billion during the same period (Wallemacq and House, 2018). The management of these disasters has become a strategic priority for governments at all levels worldwide. Disaster management involves the planning and execution of numerous phases such as preparedness, response, recovery and mitigation (Sushil, 2017). The success of operations across the various disaster management phases requires not only a high level of collaboration and coordination among all levels in the dedicated government bodies and private agencies, but also information sharing for improved decision-making execution (Chatfield et al., 2010; Sushil, 2017), along with adequate resource allocation and utilization (Altay and Pal, 2014). Information technology (IT) has been playing an important role in managing disaster phases. With the development and success of social networks, a lot of research has focused on the dissemination of information through online microblogging systems (Southwell, 2013). Microblogging platforms have not only become an important means to explore the news events, to express views and insights, but also become important places to disseminate hot topics (Ma et al., 2019). Recently, social media has been recognized as a critical player in disaster management (Kim et al., 2018; Ragini et al., 2018; Stieglitz et al., 2018; Elbanna et al., 2019). Much content generated on social media contains information about social issues and events such as natural disasters (Yoo et al., 2018). Indeed, social media platforms have been the appropriate venues for sharing information and other data during disaster management processes. Better still, they give the opportunity to rapidly send alerts and early warnings, identify critical needs, focus on responses (Carley et al., 2016; Dang et al., 2016), and provide actionable information (Murray‐Tuite et al., 2019). With social media, information can be gained in a timely manner (Pedraza‐Martinez and Van Wassenhove, 2016), while the disaster being managed can be easily evaluated in terms of severity and risks (Carley et al., 2016; Anson et al., 2017; Wu and Cui, 2018). Disaster management officers can also rely on social media to create a community preparedness ecosystem (Anson et al., 2017; Kim and Hastak, 2018; Wu and Cui, 2018), share photos and check into locations (Murthy and Gross, 2017), spread misinformation (Murthy and Gross, 2017) and easily reach the people affected and the rescue teams in the field. Visualizations based on pictures and maps are the most common tools for emergency management (Dusse et al., 2016). In the specific area of disaster management, the literature has reported the successful use of Twitter for related aspects such as planning, warning and response (Landwehr et al., 2016). Twitter has become a leading microblogging platform for disseminating events occurring in the world on a real-time basis (Choi et al., 2019; Bagloee et al., 2019). The same tool was used during important recent disasters such as the Japanese earthquake and tsunami, hurricane Sandy, and the Haiti earthquake, in order to collect and exchange information among key stakeholders (Yates and Paquette, 2011; Yoo et al., 2016; Murthy and Gross, 2017). Moreover, Twitter has been used to identify messages from disaster-stricken areas and assess the damage (Wu and Cui, 2018). As an early warning tool to detect earthquake tremors (Ragini et al., 2018), a source of information for real-time decision-making during disasters (Martinez-Rojas et al., 2018), Twitter can play even greater roles in disaster contexts. When examining Twitter use during and after typhoon Haiyan pummelled the Philippines, (Takahashi et al., 2015) concluded that “different stakeholders
236 Handbook of big data research methods used social media mostly for dissemination of second-hand information, in coordinating relief efforts, and in memorializing those affected” (p. 392). Globally, a recent systematic literature review on Twitter (Figure 15.1), Martinez-Rojas et al. (2018) gave rise to the classification of major disasters and emergency events managed by means of Twitter into “natural disasters”, “industrial disasters” and “security disasters”. However, by comparing man-made disasters (e.g., fires, explosions, oil spills, terrorist attacks, and dam evacuations) with natural disasters (e.g., earthquakes, floods, storms, hurricanes, typhoons, tsunamis) in this literature review, we noticed the predominance of the latter category over man-made disasters (as they represent only approximately 12 per cent of the total number of papers considered). Table 15.1 describes the main studies related to man-made disasters only. The papers reviewed indicate that only the response and/or recovery phases are being adequately studied in most of the literature. With regard to the subjects discussed in these studies, most of them are concerned with tweets’ content analysis, sentiment analysis, situational awareness, twitterers’ analysis, network structure analysis, patterns of retweets and information dissemination. Even if some of these studies are interested in information propagation (Abedin and Babar, 2018; Cvetojevic and Hochmair, 2018; Martinez-Rojas et al., 2019), their approach is rather holistic, without going into specificities such as the relationship between tweets’ features and information dissemination. Since some characteristics of the tweets themselves predict greater information propagation (Yang and Counts, 2010), our aim is to help understand how tweets’ features (e.g., number of words, number of URLs, number of hashtags, importance of hashtags, number of followers, hour of the tweet) influence information timeliness during a man-made fire disaster, while pondering on differences in information timeliness between the response phase and the recovery phase of such a man-made disaster. Studies such as the one by Son et al. (2019) have relied on a somewhat similar approach to investigate a natural disaster case. Man-made disasters are particular to the extent that they are very often limited to two phases, namely the response phase and the recovery phase, the former very often being faster. This can profoundly change the way information is propagated as compared to natural disasters.
Some major disaster events studied in Twitter from 2008 to 2018
Martinez-Rojas et al. (2018).
Figure 15.1
Source:
Notre-Dame de Paris cathedral is burning: let’s turn to Twitter 237
238 Handbook of big data research methods Table 15.1
Main studies on Twitter related to man-made disasters
Paper
Disaster type
Phases analysed
Subjects discussed
De Longueville et al. (2009)
Forest fire
Response and
● Content analysis
recovery
● Spatio-temporal analysis ● Roles of twitterers ● Cited URLs analysis
Abedin and Babar (2018)
Fire in Australia
Response
● Content analysis ● Institutional vs. non-institutional tweets analysis ● Dissemination of disaster information
Vieweg et al. (2010)
Oklahoma Grassfires
Recovery
Starbird et al. (2015)
2010 BP Deepwater
● Content analysis ● Enhancing situational awareness
and Red River Floods Response
● Content analysis ● Twitterers’ analysis
Horizon Oil Spill
● Topics analysis Sutton et al. (2013)
2010 BP Deepwater
Response
Horizon Oil Spill
● Content analysis ● Governmental organizations’ use of twitter ● Network structure analysis
Nagy and Stamberger (2012)
San Bruno California Response and gas explosion fire
● Content analysis
recovery
● Sentiment detection
Response and
● Content analysis
recovery
● Retweet network analysis
disaster Wang et al. (2016)
San Diego wildfire
● Spatio-temporal analysis ● Situational awareness Nayebi et al. (2017)
Fort McMurray
Recovery phase
Kaila (2016)
Fort McMurray
Recovery phase
Wildfire Martinez-Rojas et al. (2019)
Explosion of
● Content analysis ● Features suggestion for emergency mobile apps
Wildfire
● Content analysis ● Topic modelling
Recovery phase
● Content analysis ● Information propagation after the disaster
a chemical plant at Paterna (Valencia) Recovery phase
● Content analysis
Wang et al. (2016)
Sandy Hook
Benton et al. (2016)
Elementary
● Pro-gun and anti-gun sentiment analysis
School shooting in
● Understanding trends in gun violence
Connecticut Cvetojevic and Hochmair (2018)
November 2015 Paris Recovery phase
● Content analysis
terrorist attacks
● Analysis of the spatial patterns of retweets ● Prediction of counts of tweets around the world ● Identifying hierarchical spread ● Analysis of the influence of tweet content category, tweet format, and user profession on the number of retweets
Oh et al. (2011)
Mumbai terrorist attack
Recovery phase
● Content analysis ● Vulnerabilities of Twitter as a participatory emergency reporting system in the terrorism context ● Conceptual framework for analysing information control in the context of terrorism
Notre-Dame de Paris cathedral is burning: let’s turn to Twitter 239
3.
HYPOTHESES DEVELOPMENT
To understand better how characteristics of tweets influence the speed of information propagation during a fire disaster, it is necessary to set specific measures for key concepts. First, we need to reliably measure “the speed of information propagation” or the information timeliness. The information needed in this case is supposed to be closely relevant to the crisis (situational information, capital advice, decisive news media and multimedia links). During a calamity, many twitterers dedicate themselves exclusively to passing along existing messages. Therefore, they act as amplifiers of emergency information, thereby making them more visible. With Twitter, such information is most often relayed via “retweets”. Thus, the retweet activity reflects how the social network helps in the propagation of the information. The more important is the information, the more it will be retweeted. The speed at which it will be relayed could be measured thanks to the average duration of the retweet time. As a result, a natural proxy for the speed of a tweet’s information propagation would be the average retweet time. A similar measure has been advocated by Son et al. (2019). Second, language independent tweets’ characteristics that can impact information timeliness must be identified. Communication researchers have identified message features that drive information to diffuse widely (Berger and Milkman, 2012; Kim, 2015; Meng et al., 2018). For the specific case of the microblog twitter, it has been established that there are particular elements in tweets that impact attention and promote diffusion, namely: hashtags, mentions (retweet is one particular case of mentioning), replies, as well as the URLs (Orellana-Rodriguez and Keane, 2018). According to Son et al. (2019), the most important among them are: the hour of time when a tweet is posted; the total number of retweets of a tweet; the number of words in a tweet; the number of URLs in a tweet; the number of hashtags in a tweet; the number of Twitter followers that an active twitterer has; the number of Twitter friends that an active twitterer has; and the number of tweets made by a twitterer. When a message is transmitted, the reading time depends on the length of the text: the more words used, the longer will be the reading. In addition, when the communication language is unconventional (use of non-universal abbreviations, symbols, emulticons, etc.), reading and understanding a message can be time-consuming. On Twitter, the limitation of the number of characters contained in a tweet combined with the desire to communicate as quickly as possible cause users frequently to use abbreviations. As a result, a tweet containing possible abbreviations will take a long time to be read and understood, and this will delay a possible sharing with followers (retweet). We will therefore verify the following hypothesis: RH1:
The number of words in a tweet is positively associated with retweet time.
With the limitation of the number of characters contained in a tweet, any method to bypass this restriction is welcome among Twitter users. It is in this perspective that the use of external links is included. Inclusion of external URLs is an effective method for sharing information. In many cases, these links point to contents with lengths exceeding the maximal limit allowed (videos, new articles, photos, etc.). The presence of URLs is an accelerating factor for speeding up the sharing of tweets. Zarrella (2009) reported that 56.7 per cent of retweets embed URLs. The following hypothesis will be tested: RH2:
The number of URLs in a tweet is negatively associated with retweet time.
240 Handbook of big data research methods With the aim of specifying the theme, the topic or the content of a tweet, the hashtag is a concatenation of the sign “#” and letters, digits, or underscores. It is usually used as a topical keyword. Upon studying a sample of 74 million tweets, Suh et al. (2010) found that 10 per cent of tweets had at least one hashtag and that around 21 per cent of retweets in their dataset contained hashtags. Thus, the hashtag has a strong relationship with retweetability (Suh et al., 2010): tweets containing hashtags have a higher probability of being retweeted. The more hashtags you have in your tweet, the more easily identifiable are the subject it deals with and the information it contains. The rapid deciphering of the information leads to a rapid transmission or sharing of this information. Thus, the following hypothesis will be verified: RH3:
The number of hashtags in a tweet is negatively associated with retweet time.
Beyond the number of hashtags that a tweet could contain, the quality and accuracy of the terms used in the construction of these hashtags are parameters that significantly influence the decision to share a tweet previously received from followers, in the form of a new tweet. Two tweets containing the same number of hashtags with different levels of influence or precision will be shared at different speeds. In our study, the quality, influence or accuracy of a hashtag is summed up as its “importance”. Therefore, the following hypothesis will be tested: RH4: time.
The importance of the hashtags used in a tweet is negatively associated with retweet
Given differences in the need for information, the way of communicating cannot remain the same during the two phases of a man-made disaster. As an example, during the recovery phase, twitterers may need detailed information about the disaster. However, they prefer brief information during the response phase to facilitate rapid decision-making (Son et al. 2019). Thus, the following hypothesis about the response and the recovery phases will be tested: RH5: Information timeliness is higher during the response phase. RH6: Information timeliness is reduced during the recovery phase because of the number of words in a tweet. RH7: Information timeliness is increased during the recovery phase because of the number of URLs in a tweet. RH8: Information timeliness is attenuated during the recovery phase due to the significance of hashtags in a tweet.
4.
RESEARCH METHODOLOGY
4.1
Brief Description of Notre Dame de Paris
Often referred as Notre-Dame, the French medieval Catholic cathedral “Notre-Dame de Paris” is one of the most emblematic monuments in Paris. For a long time, it has been one of the biggest cathedrals in Europe. The construction of this monument fully dedicated to the Virgin Mary started in 1160 and was completed by 1260, that is, 100 years later. Following the French Revolution, Notre-Dame was partially destroyed, but it was significantly restored in 1844,
Notre-Dame de Paris cathedral is burning: let’s turn to Twitter 241 under the supervision of the famous French architect Eugène Viollet-le-Duc. The popularity of the monument had been highlighted by Victor Hugo’s novel Notre-Dame de Paris, published in 1831, which rekindled the government’s decision to restore the cathedral. Nowadays, with more than 12 million of visitors per year, Notre-Dame is the most visited monument in Paris, and one of the most popular places visited by tourists in Europe. 4.1.1 The fire disaster of Notre-Dame On 15 April 2019, the cathedral Notre-Dame de Paris faced a major man-made disaster as it was devastated by a disastrous fire. The fire started around 18:20 CEST and consumed the monument over the next 15 hours. By the time the fire was extinguished, the cathedral had suffered extensive damage: the spire was completely destroyed, the upper walls were severely damaged, as were the roofs of the nave, the transept and the frame. Investigation into the origin of the fire started the same day, and the preliminary conclusion indicated that it could have been accidental. Even if there was no human life lost, the reconstruction of this monument would be very costly. From the first estimation (made by “Fondation du Patrimoine”), the reconstruction of the cathedral could require the tune of about EUR 1 billion and at least five years of intensive work. As a monument almost 1000 years old, which is known to everyone in France and Europe, Notre-Dame being engulfed by fire evoked strong emotions. People from all walks of life, from those passionate about art, lovers of historical monuments, to religious figures and politicians, were all searching for the most recent information. The disaster was broadcast live on all international TV channels, distressing images and information being shared on all social media networks. Twitter featured as a veritable information hub for this sad event. The following categories regularly used it to inform, share news, enquire or simply comment: ● Heads of state: ● French President Emmanuel Macron [translation in English of the original tweet]: “Notre-Dame de Paris is in flames. A whole nation is in emotion. My thoughts go to all Catholics and all French people. Like all our compatriots, I am sad tonight to see this part of us burn.” (66 292 retweets and 204 571 likes); ● US President Donald Trump: “So horrible to watch the massive fire at Notre Dame Cathedral in Paris ... Perhaps flying water tankers could be used to put it out. Must act quickly!” (37 644 retweets and 202 979 likes); ● The UK Prime Minister Theresa May: “My thoughts are with the people of France tonight and with the emergency services who are fighting the terrible blaze at Notre-Dame cathedral” (1153 retweets and 6869 likes); ● Religious people: ● Pope Francis: “Today we unite in prayer with the people of France, as we wait for the sorrow inflicted by the serious damage to be transformed into hope with reconstruction. Holy Mary, Our Lady, pray for us. #NotreDame” (15 437 retweets and 77 616 likes); ● The Archbishop of Paris, His Lordship Michel Aupetit: “To all the priests of Paris: Firefighters are still fighting to save the towers of Notre-Dame de Paris. The frame, roof and spire are consumed. Pray. If you wish, you can ring the bells of your churches to invite to prayer.” (3582 retweets and 7559 likes);
242 Handbook of big data research methods
● Fans of historic monuments: ● Stéphane Bern: “In the aftermath of this tragedy that ravaged @notredameparis the jewel of our heritage, the building block of our national history, we must, beyond tears and words, unite to rebuild it. @fond_patrimoine Solidarity with @dioceseparis and thank you @PompiersParis.” (2185 retweets and 8144 likes). A lot of official and non-official information was shared via Twitter. For these reasons, we will rely on tweets to understand how people share information during a fire disaster. 4.2
Data Collection
Due to the unpredictability of a fire disaster, data can be collected only during two phases of the disaster: the response phase and the recovery phase. The response phase consists in providing a quick and effective response to the disaster, and to limit the negative consequences. The following interventions are generally carried out during this disaster phase: save and protect human life; extinguish the flames; contain the spread of flames to neighbouring buildings and mitigate their impact; provide the general public with warnings, advice and information; protect the safety of firefighters; and facilitate investigations and inquiries. For the Notre-Dame fire, the response phase started at 06:20 p.m. on 15 April 2019 and stopped 15 hours later. During the recovery phase, all aspects of the disaster’s impact meant for restoration were identified, and the process took time because of the complexity of this cathedral. It started on 16 April 2019 at around 09:20 a.m. and is estimated to take at least five years. For the purpose of this study, we selected ten days of the recovery phase. Some of these days are those immediately following the day of the disaster, plus four days after the disaster: 16 April 2019, 17 April 2019, 18 April 2019, 19 April 2019. Other days were selected from among the three weeks that followed: 10 May 2019, 11 May 2019, 13 May 2019, 15 May 2019, 16 May 2019 and 19 May 2019. Table 15.2
Keywords and hashtags used to collect tweets
Date
Keywords
15, 16, 17, 18, 19 April 2019
NotreDame, Notre-Dame, NotreDameDeParis, NotreDameFire
Hashtags #NotreDameDeParis, #NotreDameFire
Notre Dame, NotreDameCathedralFire, Notre Dame de Paris
10, 11, 13, 15, 16, 19 May
NotreDame, Notre-Dame, NotreDameDeParis, NotreDameFire,
#NotreDameDeParis,
2019
Notre Dame
#NotreDameFire
NotreDameCathedralFire, Notre_dame_de_Paris, Reconstruction,
Rebuild
With the streaming API of Twitter, tweets were collected using the keywords and hashtags shown in Table 15.2. Our dataset is made up of 95 303 tweets collected during the following days: 15 April 2019, 16 April 2019, 17 April 2019, 18 April 2019, 19 April 2019, 10 May 2019, 11 May 2019, 13 May 2019, 15 May 2019, 16 May 2019 and 19 May 2019. Tweets are collected using the streaming API of Twitter. The advantage of using the API for data collection resides in the low latency access to Twitter’s global stream of tweet data.
Notre-Dame de Paris cathedral is burning: let’s turn to Twitter 243 Only users whose tweets contain the selected keywords or hashtags were selected for the dataset. The original dataset was made up of 88 variables containing users’ information. But only a few of those variables were later used for analysis. Our dataset was made up of tweets by 52 065 unique users located in different countries. All tweets were translated into English before analysis. From a total of 95 303 tweets, 33 633 were deemed original tweets and 61 670 were retweets, with 10 084 different hashtags used globally. Empirical Methodology
4.3
In order to understand how tweets’ features influence the information timeliness during a fire disaster, we will rely on a multidimensional linear regression model. The choice of this reduced-form model is natural, since our first interest is on the correlations between information timeliness and tweets’ features. The following equations briefly describe our model: RetweetTimei 1TimeBandi 2 Followersi 3 Friendsi 4Tweeti 5 Retweeti 6Wordi 7URLi 8 Hashtagi 8 HashtagPRi i where variables are described in Table 15.3 and ε i is an error term such that:
i N 0; 2
Table 15.3
Variable descriptions and summary statistics
Description
RetweetTimei
The retweet time in minutes between tweet i and 1.83e+05
Mean
S.D.
Range
2.56e+06
4.00–2.33e+08
its retweets TimeBandi
Followersi
The hour of time when tweet i is posted
14.04
The number of Twitter followers that a twitterer 2.09e+04
6.52
0.00–23.59
3.77e+05
0.00–4.19e+07
8390.12
0.00–1.06e+06
1.75e+05
1.00–6.67e+06
who posts tweet i has Friendsi
The number of Twitter friends that a twitterer who 1479.03 posts tweet i has
Tweeti
The number of tweets that a twitterer who posts 4.02e+04 tweet i has
Retweeti
The total number of retweets of tweet i
2141.13
1.37e+04
0.00–2.55e+05
The number of words in tweet i
133.65
47.80
0.00–303.00
URLi
The number of URLs in tweet i
0.86
0.62
0.00–11.00
The number of hashtags in tweet i
0.67
1.52
0.00–31.00
HashtagPRi
The average importance of hashtags in tweet i, as 0.01
0.01
0.00–0.11
Wordi
Hashtagi
measured by the PageRank algorithm
4.4
Natural Language Processing and Social Network Analysis
Depending on the type of disaster communication (crisis or risk), different features of Twitter can be used during the information sharing process. To understand how the type of disaster
244 Handbook of big data research methods communication impacts the Twitter features used, tweets need to be grouped by topics. In the existing literature, examples of possible topics usually include: emergency, damage, recovery, news updates, and relief. They can be grouped into three phases: preparedness, response, and recovery. Because of the “man-made” nature of the fire disaster under study, we will group tweets only into two groups: response and recovery. To achieve this goal, a topic modelling experiment will be carried out, with the statistical model called the Latent Dirichlet Allocation (LDA). This is a way of automatically discovering topics that a tweet contains. The intuition of LDA is to study the representation of a fixed number of topics, and considering this number of topics, to disclose the topic distribution of each tweet. In other words, each tweet is described by a distribution of topics and each topic is described by a distribution of words. The aim of the LDA is to map all the tweets to topics, such that the words of each tweet are mostly captured by those topics. After the topic modelling step, a contrast code variable (herein known as ComTopics ) is created to examine the mean difference in retweet time between the response communication and the recovery phase. Contrast coding is usually used for re-centring categorical variables such as the intercept of the model, instead of being the mean of one category level, which is now the mean of all data points in the dataset. More specifically, we will rely on contrast-coded variables to study both the mean difference in retweet time between the crisis communication (response) and the risk communications (recovery). Moreover, we measured the importance of tweet-embedded hashtags ( HashtagPR ) using the raw data collected. The social network theory applied to all identified hashtags was called on when building this variable. In the network of hashtags, each node is a hashtag, and an edge is constructed between two hashtags appearing in the same tweet. The importance of a node can be measured by its degree or eigenvector centrality. It reveals the number of each node’s links with other nodes within the network. Eigenvector centrality also considers how well connected a node is, and the number of node links in the network, while taking into account the importance of neighbours. However, those measures suffer from a significant drawback: the difference between the quantity and quality of the connected hashtags cannot be established (Son et al., 2019). For this reason, we used the PageRank, another measurement tool of the node importance, which outperforms degree or eigenvector centrality in representing node importance. This tool is widely used by researchers nowadays. Its intuition is as follows: a score is assigned to each node depending on its number of incoming edges. These edges are also weighted depending on the relative score of their originating nodes. At the end of the day, nodes with many incoming edges are influential, and nodes to which they are connected share a part of that influence. After computing the importance of each node (in our case, each hashtag), the hashtag importance of a tweet is just the average of the hashtag’s importance included in that tweet. This measure is stored in the variable HashtagPR . With the integration of these newly constructed variables, our model is slightly modified and is described as follows: RetweetTimei 1TimeBandi 2 Followersi 3 Friendsi 4Tweeti 5 Retweeti 6Wordi 7URLi 8 Hashtagi 8 HashtagPRi 9 ComTopicsi i 10 ComTopicsi Tweeti 11ComTopicsi Wordi 12ComTopicsi URLi 13ComTopicsi HashtagPRi
Notre-Dame de Paris cathedral is burning: let’s turn to Twitter 245 where ε i is an error term such that:
i N 0; 2
5.
DATA EXPLORATION: INFORMATION TRANSMISSION PROCESS DURING A FIRE DISASTER
5.1
Twitter-based Activity During the Fire Emergency: from 15 April 2019, 18:20 to 16 April 2019, 09:20
The above period corresponds to the day of the outbreak of the fire during which firefighters were still working to extinguish the flames. It maps the response phase of the disaster according to the three common phases of a disaster (Sheppard et al., 2012). During the response day of the Notre-Dame fire disaster, the four most prominent words used to tweet were “notredame”, “fire”, “paris” and “cathedral”. This finding is in line with the topic of the event being covered. The word frequency histogram also revealed that people were very “sad” about the huge artistic losses caused by the gigantic flames that consumed the cathedral. In fact, around one thousand years of world art history was burned. This is why the terms “history”, “world”, “flames”, “art” frequently emerged from the analysis of tweets. An important proportion of tweets that were posted and shared generally appeared as a tribute to the firefighters for their professionalism and bravery. One of the firefighters’ most appreciated actions was that they saved the main structure of the cathedral and protected most of the abundant valuable artistic treasures that this monument contained. Their work was praised on Twitter through the use of words like “firefighters”, “saved”, “part” and “structure” (all included).
Figure 15.2
Most frequent words used during the day of the fire
246 Handbook of big data research methods The bar plot in Figure 15.2 is limited only to the first 19 most frequently used words. To have a better graphical representation of these words and their frequency, we relied on a word cloud (Figure 15.3). Regarding the word content of the tweets analysed for the response phase, it was found that the main topics of interest during this phase could be grouped, in order of importance, into the following four categories: ● presentation of the context: this implies the use of words describing the cathedral, its geographical situation, its history, its importance and its current state in flames; ● emotion towards the fire: the most common words here describe the affection for the cathedral, and the sadness about the situation; ● firefighters: specific words used for those who combated the flames; ● political speech relating to the fire disaster: the key reference here is the interventions of the French and American Presidents in the media.
Figure 15.3
Word cloud during the response phase
The available literature on disasters refers to the response phase as the disaster management step in which preparative plans are developed (notably during the preparedness phase) and translated into actions, and critical information shared for the people directly or indirectly involved (crisis communication). Our analysis of the topics of interest for this phase did not enable us to identify these communication phases for our disaster case. This could be justified in three ways: (1) the fire disaster of Notre-Dame de Paris is a man-made disaster without a preparedness phase; (2) only the firefighters were directly concerned by the actions
Notre-Dame de Paris cathedral is burning: let’s turn to Twitter 247 geared towards extinguishing the flames; and (3) it was not a life-threatening disaster for the population. So, tweets in this phase certainly came mostly from viewers and commentators. Regarding the frequency of tweets with respect to the time frame, we can observe a sharp decrease around midnight and just after (Figure 15.4). A rationale behind this could be that, by midnight in France, a large part of the population is sleeping. It may also be assumed that after the peak of the emotion caused by this fire, all the population could do at that time was simply to wait for the final extinguishing of the flames and reconstruction solutions.
Source:
Data collected from Twitter’s REST API via retweet.
Figure 15.4
Tweet counts aggregated at 1-minute frequency from 15 April 2019, 18:20 to 16 April 2019, 09:20
Beyond its usefulness in the computation of the importance of hashtags and in the construction of the HashtagPR variable, the hashtags network map is a major tool for understanding the process of information transmission through Twitter during a fire disaster (see Figure 15.5). It informs us not only about various kinds of concerns raised by Twitter users from all over the world, but also about the role of central hashtags as well as the order of co-occurrence of hashtags in a given tweet. From tweets extracted during the day of the fire using the keyword “NotreDamedeParis”, it appears that the major subject is the fire incident that engulfed this ancient cathedral: this is justified by the central position in the hashtag network map of “#NOTREDAMEFIRE”. Furthermore, the map indicates that users commented broadly on the US President Donald Trump’s proposal to use Canadair to extinguish the flames, which is illustrated by a directional link between the hashtags “#NOTREDAMEFIRE” and “#TRUMP”. The hashtag “#MACRON” also holds a central position on the network map. The use of the hashtags “#FIRE” or “#FRANCE” is very likely to refer to “#MACRON”, a hashtag with many incoming edges, which is evidence that the intervention of the incumbent President of the Republic in relation to the disaster was quite well commented on by people. The presence of “#ART” and
248 Handbook of big data research methods “#CULTURE” on the network map, which is indirectly related to “#NOTREDAMEFIRE”, reveals that the population were sad and scared to witness the disappearance of a major heritage of modern art and culture.
Figure 15.5
Hashtag network during the response phase
Source: The authors, using the original S code by Richard A. Becker, Allan R. Wilks. R version by Ray Brownrigg. Enhancements by Thomas P Minka and Alex Deckmyn (2018). Maps: Draw Geographical Maps. R package version 3.3.0. https://CRAN.R-project.org/package=maps.
Figure 15.6
Tweets’ locations during the response phase
Notre-Dame de Paris cathedral is burning: let’s turn to Twitter 249 The Notre-Dame de Paris cathedral was extremely famous (13 million visitors per year), and the news about this building being on fire provoked sadness around the world. The tweets collected actually show that reactions originated from many locations globally, but mainly from Europe, America and Asia (Figure 15.6). Twitterers posted mainly in English, French, Spanish and Italian (Figure 15.7). Strangely, the map in Figure 15.6 shows that very few tweets came from Africa, Russia and the Middle East.
Figure 15.7 5.2
Languages used during the response phase
Twitter-based Activity Days after the Fire: 16–19 April 2019
This period corresponds to the first four days that followed the day of the fire. The first eight hours of the day of the accident belong to the response phase because the fire had not yet been extinguished by that time. Next was the recovery phase, which, according to the three common phases of a disaster, corresponds to the end of the response phase and the beginning of the recovery phase.
Figure 15.8
Most frequent words during first days after the fire
250 Handbook of big data research methods As in the previous phase, the words “notredame”, “fire”, “paris” and “cathedral” remained the most frequently used in tweets. They naturally present the context of tweets. In terms of frequency, these contextual words were followed by words related to the reconstruction (“reconstruction”, “rebuilt”, “million”), and to a lesser extent, by words expressing emotion (“heart”), and those regarding the French President’s post (“macron”). The last category of words (“thank”) illustrated the appreciation of firefighters’ interventions (“work”) (see Figure 15.8). Figure 15.9 shows the next word cloud, which provides an in-depth, broad analysis of words used in this phase. The most frequent terms illustrate the focus of tweets on the three selected categories of the days after the fire: ● Terms related to the presentation of the context: there are a few words describing the cathedral and its geographical situation (“notre dame”, “cathedral”, “paris”, “heritage”, “history”, “years”, and so forth). ● Terms related to the reconstruction of the cathedral: they are related to reconstruction and rebuild, but there are also many words about donations and other forms of assistance for reconstruction (“reconstruction”, “rebuilt”, “donations”, “money”, “million”, etc.). ● Words about political figures: such words are in relation to the postings of French President Macron and of his US counterpart (“president”, “macron”, “trump”). The hashtag network in Figure 15.10 actually shows a relationship with the media outings of the French President and with another hot topic of the same period (e.g. the “gilets jaunes” crisis in France).
Figure 15.9
Term cloud for the first days after the fire
Notre-Dame de Paris cathedral is burning: let’s turn to Twitter 251 From the hashtag network, it can be seen that most central hashtags with more incoming links are #NOTREDAMEFIRE and #FRANCE, with almost all the other hashtags related to them directly or indirectly. This is normal because they carry the central subject of our study. We can also notice the central position of the hashtags #MACRON and #TRUMP in relation to many incoming links, which may also indicate a relationship between the majority of tweets posted at that particular period and Macron’s and Trump’s declarations about the fire, the reconstruction, and other peripheral topics such as the “gilet jaunes” (#GILETSJAUNE) crisis in France and the holy week in the United States (#SEMANASANTA). Overall, it can clearly be seen that the emotion observed from the previous phase (response phase) gave room to reflection on the reconstruction in this phase, even if we still have peripheral tweets on political speech and new developments concerning the fire incident.
Figure 15.10 Hashtag network during first days after the fire In the literature, the recovery phase is generally defined as the step in which actions are taken to recover from the disaster and to prevent future disasters (risk communication). The topics of interest identified are adequately aligned with the communication actions expected in the recovery phase. It should be noted that active twitterers from among the general population are, in our case, mostly spectators and commentators on the recovery phase. As in the previous period (response phase), the map (Figure 15.11) clearly indicates that reactions concerning the fire outbreak continued to flow in from around the world during the recovery phase. In this period, tweets came mainly from Europe, America and Australia. In America, people from the USA and South American countries were the main posters on Twitter in this regard. Once again, African residents did not make a single post, whereas Asia made just a few. The next bar plot of languages (Figure 15.12) shows that the most used languages in postings were French, English, Spanish, German, Italian and Turkish. The French language possibly topped the chart for obvious reasons, one of them being that Notre-Dame de Paris is a French monument and that the French population is the first to be concerned by reconstruction works
252 Handbook of big data research methods – and this is related to the recovery phase. However, more than 30 different languages were used in postings, thus demonstrating the global impact of the event. These languages include almost all European languages, plus those spoken in South and North American countries. Two Asian languages (Japanese and Korean) were also used at this stage, even though they do not feature in the above-mentioned country location map of tweets.
Source: The authors, using the original S code by Richard A. Becker, Allan R. Wilks. R version by Ray Brownrigg. Enhancements by Thomas P Minka and Alex Deckmyn (2018). Maps: Draw Geographical Maps. R package version 3.3.0. https://CRAN.R-project.org/package=maps.
Figure 15.11 Tweets’ locations during the first days after the fire
Figure 15.12 Languages used in tweets during the first days after the fire
Notre-Dame de Paris cathedral is burning: let’s turn to Twitter 253 Regarding the frequency of tweets with respect to the time frame in Figure 15.13, we can see a pattern for almost every analysed day that shows an increase in the frequency of tweets throughout the day and a sharp decrease around midnight and just after. For the first three days in the figure, we can observe some peaks in the early or late evening. An explanation for this may be that most of the tweets came from Europe and that people tend to react more throughout the day, and even more in the evenings, after work.
Source:
Data collected from Twitter’s REST API via retweet
Figure 15.13 Tweet counts aggregated at 1 minute frequency from 16 April 2019, 09:20 to 20 April 2019, 00:00 5.3
Twitter Activity Three Weeks After the Fire
Three weeks after the flames are extinguished, we are still in the recovery phase. The disaster is still present in the minds of individuals and the government has been thinking about reconstruction programmes. How to finance the reconstruction works remains a key issue, and campaigns appealing for donations are launched all over the territory and internationally. As in the previous periods, the words “notredame”, “fire”, “paris” and “cathedral” were the words used most frequently by twitterers (Figure 15.14). They naturally present the context of tweets. They then gave rise to terms such as “rebuild”, “rebuilt”, “million”, all related to reconstruction.
254 Handbook of big data research methods
Figure 15.14 Most frequently used words three weeks after the fire disaster
The word cloud in Figure 15.15 provides an in-depth and holistic analysis of words used in this phase.
Figure 15.15 Word cloud three weeks after the fire disaster According to the content of the tweets that are being analysed in this word cloud, tweets’ words can be grouped into three categories: ● presentation of the context: there are a few words describing the cathedral and its geographical situation (“notre dame”, “paris”, “cathedral”, “fire”, etc.);
Notre-Dame de Paris cathedral is burning: let’s turn to Twitter 255 ● reconstruction of the cathedral: the words used in tweets are about reconstruction (“reconstruction”, “rebuild”, “restoration”, “million”). Compared to the previous period (four days after the fire), words of this category are much more prominent. ● peripheral subjects: we have a lot of words that can refer to many other peripheral subjects. Such peripheral subjects are in larger numbers compared to the previous period. They are related, inter alia, to news, politics, culture and tourism. The hashtag network in Figure 15.16 reveals that the most central hashtags with the most incoming links (almost all of which are directly or indirectly linked to other hashtags) are #TRAVEL, #FRANCE, #NOTREDAMEFIRE, #PATRIMOINE, #TOURISM. Contrary to the previous period (beginning of the recovery phase), this phase is characterized by the appearance of very popular hashtags, as well as other tags related to the context of the study (#NOTREDAMEFIRE, or #FRANCE), but also to travel and tourism. We can deduce for example that the subjects developed in this period (three weeks after the fire) are more about the consequences of the fire, especially on tourism and travelling, art, culture and the “gilets jaunes” crisis, etc.
Figure 15.16 Hashtag network three weeks after the fire disaster The reconstruction subject, which was already mentioned in the previous period, is emphasized in this period (word cloud) and linked to more and more peripheral subjects (word cloud and hashtag network). There is an increase in the number of words concerning the reconstruction of the cathedral, as well as on multiple peripheral subjects that can be related to risk communication. The map in Figure 15.17 shows that, as in the previous period (beginning of the recovery phase), the fire still provokes reactions from around the world. In this period, tweets come mainly from Europe, America and Australia. Residents of the United States and South American countries are the main posters on the American continent. Unlike the previous period, we can observe the appearance of tweets from the Asian and African continent. Thus,
256 Handbook of big data research methods
Source: The authors, using the original S code by Richard A. Becker, Allan R. Wilks. R version by Ray Brownrigg. Enhancements by Thomas P. Minka and Alex Deckmyn (2018). Maps: Draw Geographical Maps. R package version 3.3.0. https://CRAN.R-project.org/package=maps.
Figure 15.17 Tweets’ locations three weeks after the fire disaster there is also a geographical expansion of the sources of tweets, together with an expansion of topics in this period. The bar plot of languages used presents almost the same languages that were used in tweets in the previous period (Figure 15.18). The most used languages are still French, English, Spanish, German, Italian and Turkish, though more than 30 different languages are identified, thereby demonstrating the universal breadth of the event concerned. We have almost all European languages, those of South and North American countries, and some Asian languages. Table 15.4 is more explicit about language codes used in the tweets’ language plot.
Figure 15.18 Tweets’ languages three weeks after the fire disaster
Notre-Dame de Paris cathedral is burning: let’s turn to Twitter 257
Source:
Data collected from Twitter’s REST API via retweet.
Figure 15.19 Tweet counts aggregated at 1-hour frequency three weeks after the fire disaster In terms of tweet frequency, almost every analysed day brings out a pattern illustrating an increase in the frequency of tweets throughout the day and a sharp decrease around midnight and just after (Figure 15.19). We noticed the same behaviour in the previous period.
Name
Code
Language
Name
Code
Language
Name
Code
Language
Name
Code
Language
Table 15.4
Tagalog
tl
Lithuanian
lt
Finnish
fi
Arabic
ar
Turkish
tr
Latvian
lv
French
fr
Bulgarian
bg
Language codes
Ukranian
uk
Dutch
nl
Croatian
hr
Catalan
ca
Undefined
und
Norwegian
no
Haitian
ht
Czech
cs
Vietnamese
vi
Polish
pl
Hungarian
hu
Danish
da
Chinese
zh
Portuguese
pt
Armenian
hy
German
de
Romanian
ro
Indonesian
in
Greek
el
Russian
ru
Italian
it
English
en
Slovenian
sl
Hebrew
iw
Spanish
es
Serbian
sr
Japanese
ja
Estonian
et
Swedish
sv
Georgian
ka
Basque
eu
Thai
th
Korean
ko
Farsi
fa
258 Handbook of big data research methods
Notre-Dame de Paris cathedral is burning: let’s turn to Twitter 259
6.
RESULTS AND DISCUSSION
The aim of this section is to understand how tweets’ features influence information timeliness during a fire disaster. We have opted for relying on a multidimensional linear regression model to attain our objective. Before running the regression model, some important variables are being constructed. 6.1
Topic Modelling
To understand how the type of disaster communication impacts the use of Twitter’s features, tweets need to be grouped by topics. By applying the Latent Dirichlet Allocation, we obtained a set of topics with more important words describing each topic, and for each tweet, a topic distribution (the probability for the tweet belonging to each topic). Figure 15.20 shows a two-topic model as it appeared at both the response phase (left) and the recovery phase (right).
Figure 15.20 A double-topic model In the response phase (topic 1), the descriptive words are related to the context of the event (“fire”, “flames”, “burned”, “church”, “building”), to the importance of the cathedral (“heritage”, “world”, “history”, “unesco”, “art”, “oldest”, “great”), to the location (“europe”, “site”, “located”) and to emergency cases (“help”). Other words characterizing this topic are related to specific parts of the cathedral that people were worried about (“roof”, “spire”, “hunchback”, “part”), to the engendered emotion (“heart”, “suffered”, “beauty”, “beautiful”) and to the work of firefighters (“firefighters”, “work”). Topic 2 has to do with the recovery phase and describes the situation with words strongly related to the reconstruction (“reconstruction”, “million”, “rebuilt”, “rebuild”, “restoration”,
260 Handbook of big data research methods “donations”, “scientists”), the news (“news”), to the humans (“people”, “human”, “species”), to the monument’s design (“design”, “architecture”) and to the location (“parisians”). Unlike the response phase, we can observe a wide diversity of other words that could be linked to several other issues, including crisis (certainly the “gilets jaunes” crisis), macron (that is, the statement of the French President), proposals, environmental matters, impacts, endangered species, vision, urban status, law, films, networks, and so on. Another observation is that the topic modelling is aligned with the results obtained upon analysis of words and hashtag networks for the response and recovery phases. While the response phase is mostly related to context of the event, the importance of the cathedral, the emotions and the emergency cases identified, the recovery phase rather deals with the reconstruction and, more importantly, with a number of peripheral issues, as illustrated by the words used in tweets. 6.2
Regression Results
This subsection discloses the results from our multidimensional linear regression model to understand how tweets’ features influence information timeliness during a fire disaster. To examine the mean difference in retweet time between the response communication and the recovery phase, the contrast code variable com _ topicsRecovery created during the topic modelling step has been introduced. This variable takes the value 1 when the tweet belongs to the recovery phase and 0 elsewhere. In RH1, we hypothesized that the number of words in a tweet is positively associated with retweet time. With the positive estimate value (0.840) for the “Word” variable and zero p-value, we can clearly state that this hypothesis is strongly verified. This result is consistent with existing literature results. Looking at the coefficient of the cross variable “Word × com_ topicsRecovery”, it appears that during the recovery phase, the effect of the number of words of a tweet is bigger: this coefficient increases significantly by 0.204 relative to the response phase. Thus, the presence of more words in tweets slows down the spread of information via retweets and this effect is more pronounced during the recovery phase. Table 15.5
Regression results
Variable
Estimate
Std. Error
t value
Pr(>|t|)
Intercept
6.583
0.360
18.27
0.000
***
TimeBand
−0.321
0.009
−33.80
0.000
***
Followers
−0.119
0.024
−4.915
0.000
***
Friends
0.111
0.021
5.189
0.000
***
Retweet
0.307
0.008
40.22
0.000
***
Tweet
−0.190
0.008
−24.86
0.000
***
Word
0.840
0.072
11.59
0.000
***
URL
−0.145
0.014
−10.15
0.000
***
Hashtag
0.013
0.013
0.986
0.323
HashtagPR
−0.275
0.013
−21.77
0.000
***
com_topicsRecovery
−0.561
0.553
−1.015
0.310
Tweet × com_topicsRecovery
0.021
0.010
1.972
0.007
***
Word × com_topicsRecovery
0.204
0.112
1.830
0.049
*
URL × com_topicsRecovery
−0.218
0.020
−11.06
0.067
.
Notre-Dame de Paris cathedral is burning: let’s turn to Twitter 261 Variable
Estimate
Std. Error
t value
Pr(>|t|)
HashtagPR × com_topicsRecovery
0.111
0.021
5.290
0.000
***
Signif. codes: level 0.1% ‘***’; level of 1% ‘**’; level of 5% ‘*’; level of 10% ‘.’. Residual standard error: 2.302 on 61082 degrees of freedom. R2: 0.115; Adj.R2: 0.1148; F-stat: 567.2 on 14 and 61082 DF; p-value: < 2.2e-16
For RH2, during a fire disaster, the number of URLs in a tweet is negatively associated with retweet time. With the negative estimate value (−0.145) for the number of url (URL) variable and zero p-value, we can clearly state that we found support for this hypothesis. More precisely, an increase in the number of URLs by one unit reduces the retweet time by 14.5 per cent. By restricting ourselves to the recovery phase (this is achieved by looking at the coefficient of the variable URL × com_topicsRecovery which is around −0.218), it appears that the impact of the number of URLs of a tweet is more important during the recovery phase of a fire disaster. These coefficients lead to the following interpretation: an increase in one unit of the number of URLs of a tweet during the recovery phase of a fire disaster can reduce the retweet time by 35 per cent. Thus, the presence of more URLs in tweets speeds up the spread of information via retweets with a more important effect during the recovery phase. In RH3, we wanted to check whether the number of hashtags in a tweet is negatively associated with retweet time. This assumption didn’t seem to be verified: we got a non-significant positive estimate value (0.013) for the number of hashtags (hashtags) variable. However, it emerged that it is the importance of the hashtag of a tweet (measured by the variable HashtagPR) which has a significant negative impact on the retweet time. In other words, the greater the importance of hashtags in a tweet, the smaller will be the average time between that tweet and its retweets. By examining the mean difference in retweet time between the response and the recovery phases due to the hashtag importance (by looking at the coefficient of the variable HashtagPR × com_topicsRecovery), we found that during the recovery phase, the effect of the hashtag importance variable is reduced. Hypothesis RH4, which stated that the importance of the hashtags used in a tweet is negatively associated with retweet time, is thus verified. Beyond the main hypothesis of interest, Table 15.4 also reveals the following links between user features and information timeliness: (1) the number of followers and the number of tweets by an active twitterer reduce the average retweet time; (2) the total number of retweets of a tweet and the number of Twitter friends of an active twitterer increase the average retweet time. 6.3 Implications Man-made disaster situations present a challenge for emergency services: people who provide information face physical danger, and the demand for active management of information is high. Over the last decade, the dissemination channel of emergency information has drastically changed – people are moving from the traditional unidirectional channels (e.g., radio, TV, etc.) to multidirectional ones based on microblogs (e.g., Twitter). For this reason, emergency institutions need an effective and efficient presence on social networks to rapidly deliver accurate and useful information. Their social media activities should consist in collecting and processing texts, images, video and other data in order to spread information likely to support their rescue activities; hearing from the public; acquiring disaster situational awareness information; providing searchable information and notifications; providing localized disaster
262 Handbook of big data research methods situational information; detecting or predicting critical events. To be effective and efficient, emergency agencies should find strategies to collect, process and disseminate useful disaster information faster. Our study provides interesting contributions to attain this objective using Twitter as a microblog. First, when collecting or disseminating information concerning a specific fire disaster, emergency services should use high quality topical keywords, namely important hashtags. A higher number of important hashtags will lead to increased information timeliness or information extraction. Second, by focusing on the sharing of information during a fire disaster, stakeholders should reduce the number of words used in their tweets. They should favour the use URLs. This will permit a reduction of reading time and help to redirect the followers to external web pages where images or videos will be directly available. Third, since information timeliness is impacted by disaster phases, the communication strategy should be adapted accordingly. During the response phase, emergency services should spread short, precise and concise information. More URLs could be used during the recovery phase. Apart from the content features of tweets, some user characteristics can be considered in order to improve the dissemination of information during a fire disaster. Emergency agencies could work closely with users with a lot of followers, or with users with a high tweet activity. The latter category could be solicited to relay critical information. This is supported by our results, which revealed that the more a user has followers or tweets, the higher will be the diffusion speed of their tweets.
7.
CONCLUSIONS AND PERSPECTIVES
In this chapter we studied the usage of Twitter during a disaster. Unlike the majority of current works on information analysis on Twitter during disasters such as tsunamis, floods, earthquakes, hurricanes, or terrorist attacks, we focused here on a fire disaster event, namely the one that affected the Notre-Dame de Paris cathedral in April 2019. Given the nature, importance and global impact of this disaster, the study of communication patterns on popular social media such as Twitter could be relevant for organizations or people who wish to communicate effectively during this type of event. More specifically, the purpose of this chapter was twofold. First, we wanted to understand information content and differences in information disseminated between the phases of this disaster (response and recovery phases). Second, we were interested in understanding how tweets’ features (number of hashtags, importance of hashtags, number of worlds, number of URLs, time band, etc.) can influence information timeliness during a fire disaster. In our content analysis, we found that topics during the response phase were different from topics in the recovery phase. In the response phase, the topics developed were those aligned with crisis communication in disaster management. Therefore, the information shared at this stage included the disaster context, the history and the importance of the event, the worldwide engendered emotions, some emergency topics related to firefighters’ work or important parts of the cathedral, and marginally allusions to political discourse. From the following days (beginning of the recovery phase) up to the next three weeks, these topics gave way to others concerned with the reconstruction of the monument, future impacts of the disaster, and
Notre-Dame de Paris cathedral is burning: let’s turn to Twitter 263 multiple peripheral topics. These topics could be related to risk communication in disaster management. In order to understand how tweets’ features influence information timeliness, we relied on a multidimensional linear regression model with the average retweet time as target variable. First, we found that the presence of more words in tweets slows down the spread of information via retweets and that this effect is more pronounced during the recovery phase. Second, the presence of more URLs in tweets speeds up the spread of information via retweets, with a more important effect during the recovery phase. Third, the number of hashtags in a tweet is not significantly negatively associated with retweet time. However, the extent to which hashtags used in a tweet are important determines the average time between that tweet and its retweets. Also, during the recovery phase, the effect of the “hashtag importance” variable is reduced. Fourth, the number of followers and the number of tweets by an active twitterer reduce the average retweet time; the total number of retweets of a tweet and the number of Twitter friends of that active twitterer increase the average retweet time. These results give us some essential elements of communications on Twitter during a human-made disaster, particularly with regard to the fire disaster that affected the Notre-Dame de Paris cathedral in 2019. It should be noted that communication on this type of disaster through the Twitter platform is yet to be fully and extensively studied. Therefore, studies like this one, no matter the specific human-made disaster concerned, are needed to gather more data that may help improve the robustness of our results or serve for comparison purposes. The content analysis conducted showed a great amount of emotional data conveyed through tweets. A sentimental analysis could be carried out to deepen the understanding of such emotional content and related impacts, and thus characterize the polarity of emotions expressed during a man-made disaster (He et al., 2015; Stieglitz and Dang-Xuan, 2013). The theoretical framework for such an analysis could also be studied from existing hypotheses or results in fields such as psychology and other social sciences (Son et al., 2019). Furthermore, other future works could emphasize the analysis of Twitter users’ profiles (people, communities or organizations), the relations between users (social network analysis), tweets’ contents, and their impact on disaster management information dissemination on a social media platform like Twitter.
REFERENCES Abedin, Babak and Abdul Babar. 2018. “Institutional vs. non-institutional use of social media during emergency response: A case of Twitter in 2014 Australian bush fire.” Information Systems Frontiers 20 (4): 729–40. Altay, Nezih and Raktim Pal. 2014. “Information diffusion among agents: Implications for humanitarian operations.” Production and Operations Management 23 (6): 1015–27. Anson, Susan, Hayley Watson, Kush Wadhwa and Karin Metz. 2017. “Analysing social media data for disaster preparedness: Understanding the opportunities and barriers faced by humanitarian actors.” International Journal of Disaster Risk Reduction 21: 131–9. Bagloee, Saeed Asadi, Karl H. Johansson and Mohsen Asadi. 2019. “A hybrid machine-learning and optimization method for contraflow design in post-disaster cases and traffic management scenarios.” Expert Systems With Applications 124: 67–81. Baxter, Graeme and Rita Marcella. 2017. “Voters’ online information behaviour and response to campaign content during the Scottish referendum on independence.” International Journal of Information Management 37 (6): 539–46.
264 Handbook of big data research methods Benton, Adrian, Braden Hancock, Glen Coppersmith, John W. Ayers and Mark Dredze. 2016. “After Sandy Hook Elementary: A year in the gun control debate on Twitter.” ArXiv Preprint ArXiv:1610.02060. Berger, Jonah and Katherine L. Milkman. 2012. “What makes online content viral?” Journal of Marketing Research 49 (2): 192–205. Carley, Kathleen M., Momin Malik, Peter M. Landwehr, Jürgen Pfeffer and Michael Kowalchuck. 2016. “Crowd sourcing disaster management: The complex nature of Twitter usage in Padang Indonesia.” Safety Science 90: 48–61. Chatfield, Akemi T., Samuel Fosso Wamba and Hirokazu Tatano. 2010. “E-government challenge in disaster evacuation response: The role of RFID technology in building safe and secure local communities.” In 2010 43rd Hawaii International Conference on System Sciences, 1–10. IEEE. Chen, Xin, Mihaela Vorvoreanu and Krishna Madhavan. 2014. “Mining social media data for understanding students’ learning experiences.” IEEE Transactions on Learning Technologies 7 (3): 246–59. Choi, Hyeok-Jun and Cheong Hee Park. 2019. “Emerging topic detection in twitter stream based on high utility pattern mining”. Expert Systems With Applications 115: 27–36. Cvetojevic, Sreten and Hartwig H. Hochmair. 2018. “Analyzing the spread of tweets in response to Paris attacks.” Computers, Environment and Urban Systems 71: 14–26. Dang, Qi, Feng Gao and Yadong Zhou. 2016. “Early detection method for emerging topics based on dynamic Bayesian networks in micro-blogging networks.” Expert Systems With Applications 57: 285–95. De Longueville, Bertrand, Robin S. Smith and Gianluca Luraschi. 2009. “OMG, from here, I can see the flames!: A use case of mining location based social networks to acquire spatio-temporal data on forest fires.” In Proceedings of the 2009 International Workshop on Location Based Social Networks, 73–80. ACM. Dusse, Flávio, Paulo Simões Júnior, Antonia Tamires Alves, Renato Novais, Vaninha Vieira and Manoel Mendonça. 2016. “Information visualization for emergency management: A systematic mapping study.” Expert Systems With Applications 45: 424–37. Elbanna, Amany, Deborah Bunker, Linda Levine and Anthony Sleigh. 2019. “Emergency management in the changing world of social media: Framing the research agenda with the stakeholders through engaged scholarship.” International Journal of Information Management 47: 112–20. Gruebner, Oliver, Martin Sykora, Sarah R. Lowe, Ketan Shankardass, Ludovic Trinquart, Tom Jackson, S.V. Subramanian et al. 2016. “Mental health surveillance after the terrorist attacks in Paris.” The Lancet 387 (10034): 2195–6. He, Wu, Harris Wu, Gongjun Yan, Vasudeva Akula and Jiancheng Shen. 2015. “A novel social media competitive analytics framework with sentiment benchmarks.” Information & Management 52 (7): 801–12. Imran, Muhammad, Carlos Castillo, Fernando Diaz and Sarah Vieweg. 2015. “Processing social media messages in mass emergency: A survey.” ACM Computing Surveys (CSUR) 47 (4): 67. Kaila, Rajesh Prabhakar. 2016. “An empirical text mining analysis of Fort McMurray wildfire disaster Twitter communication using topic model.” Disaster Advances 9: 1–6. Kim, Hyun Suk. 2015. “Attracting views and going viral: How message features and news-sharing channels affect health news diffusion.” Journal of Communication 65 (3): 512–34. Kim, Jooho and Makarand Hastak. 2018. “Social network analysis: Characteristics of online social networks after a disaster.” International Journal of Information Management 38 (1): 86–96. Kim, Jooho, Juhee Bae and Makarand Hastak. 2018. “Emergency information diffusion on online social media during Storm Cindy in US.” International Journal of Information Management 40: 153–65. Landwehr, Peter M., Wei Wei, Michael Kowalchuck and Kathleen M. Carley. 2016. “Using Tweets to support disaster planning, warning and response.” Safety Science 90: 33–47. Lu, Rong and Qing Yang. 2012. “Trend analysis of news topics on Twitter.” International Journal of Machine Learning and Computing 2 (3): 327. Ma, Tinghuai, YuWei Zhao, Honghao Zhou, Yuan Tian, Abdullah Al-Dhelaan and Mznah Al-Rodhaan. 2019. “Natural disaster topic extraction in Sina microblogging based on graph analysis.” Expert Systems With Applications 115: 346–55.
Notre-Dame de Paris cathedral is burning: let’s turn to Twitter 265 Martinez-Rojas, Maria, Maria del Carmen Pardo-Ferreira and Juan Carlos Rubio-Romero. 2018. “Twitter as a tool for the management and analysis of emergency situations: A systematic literature review.” International Journal of Information Management 43: 196–208. Martinez-Rojas, Maria, María del Carmen Pardo-Ferreira, Antonio López-Arquillos and Juan Carlos Rubio-Romero. 2019. “Using Twitter as a tool to foster social resilience in emergency situations: A case study.” In Engineering Digital Transformation, 243–45. Springer. Meng, Jingbo, Wei Peng, Pang-Ning Tan, Wuyu Liu, Ying Cheng and Arram Bae. 2018. “Diffusion size and structural virality: The effects of message and network features on spreading health information on Twitter.” Computers in Human Behavior 89: 111–20. Murray‐Tuite, Pamela, Y. Gurt Ge, Christopher Zobel, Roshanak Nateghi and Haizhong Wang. 2019. “Critical time, space, and decision‐making agent considerations in human‐centered interdisciplinary hurricane‐related research.” Risk Analysis. Murthy, Dhiraj and Alexander J. Gross. 2017. “Social media processes in disasters: Implications of emergent technology use.” Social Science Research 63: 356–70. Nagy, Ahmed and Jeannie Stamberger. 2012. “Crowd sentiment detection during disasters and crises.” In Proceedings of the 9th International ISCRAM Conference, 1–9. Nayebi, Maleknaz, Mahshid Marbouti, Rachel Quapp, Frank Maurer and Guenther Ruhe. 2017. “Crowdsourced exploration of mobile app features: A case study of the Fort McMurray wildfire.” In Proceedings of the 39th International Conference on Software Engineering: Software Engineering in Society Track, 57–66. IEEE Press. Oh, Onook, Manish Agrawal and H. Raghav Rao. 2011. “Information control and terrorism: Tracking the Mumbai terrorist attack through Twitter.” Information Systems Frontiers 13 (1): 33–43. Orellana-Rodriguez, Claudia and Mark T. Keane. 2018. “Attention to news and its dissemination on Twitter: A survey.” Computer Science Review 29: 74–94. Accessed at https://doi.org/10.1016/j.cosrev .2018.07.001. Pedraza‐Martinez, Alfonso J. and Luk N. Van Wassenhove. 2016. “Empirically grounded research in humanitarian operations management: The way forward.” Journal of Operations Management 45 (1): 1–10. Ragini, J. Rexiline, P.M. Rubesh Anand and Vidhyacharan Bhaskar. 2018. “Big data analytics for disaster response and recovery through sentiment analysis.” International Journal of Information Management 42: 13–24. Rathore, Ashish Kumar, Santanu Das and P. Vigneswara Ilavarasan. 2018. “Social media data inputs in product design: Case of a Smartphone.” Global Journal of Flexible Systems Management 19 (3): 255–72. Sheppard, Ben, Melissa Janoske and Brooke Liu. 2012. “Understanding risk communication theory: A guide for emergency managers and communicators.” National Consortium for the Study of Terrorism and Responses to Terrorism (START). University of Maryland, 1–27. Son, Jaebong, Hyung Koo Lee, Sung Jin and Jintae Lee. 2019. “Content features of tweets for effective communication during disasters: A media synchronicity theory perspective.” International Journal of Information Management 45: 56–68. Southwell, Brian G. 2013. Social Networks and Popular Understanding of Science and Health: Sharing Disparities. JHU Press. Starbird, Kate, Dharma Dailey, Ann Hayward Walker, Thomas M. Leschine, Robert Pavia and Ann Bostrom. 2015. “Social media, public participation, and the 2010 BP Deepwater Horizon oil spill.” Human and Ecological Risk Assessment: An International Journal 21 (3): 605–30. Stieglitz, Stefan and Linh Dang-Xuan. 2013. “Emotions and information diffusion in social media— Sentiment of microblogs and sharing behavior.” Journal of Management Information Systems 29 (4): 217–48. Stieglitz, Stefan, Milad Mirbabaie, Björn Ross and Christoph Neuberger. 2018. “Social media analytics– Challenges in topic discovery, data collection, and data preparation.” International Journal of Information Management 39: 156–68. Suh, Bongwon, Lichan Hong, Peter Pirolli and Ed H. Chi. 2010. “Want to be retweeted? Large scale analytics on factors impacting retweet in Twitter network.” Proceedings – SocialCom 2010: 2nd IEEE International Conference on Social Computing, PASSAT 2010: 2nd IEEE International Conference
266 Handbook of big data research methods on Privacy, Security, Risk and Trust, 177–84. Accessed at https://doi.org/10.1109/SocialCom.2010 .33. Sushil. 2017. “Theory building using SAP–LAP linkages: An application in the context of disaster management.” Annals of Operations Research, 1–26. Accessed at https://doi.org/10.1007/s10479-017 -2425-3. Sutton, Jeannette, Emma Spiro, Carter Butts, Sean Fitzhugh, Britta Johnson and Matt Greczek. 2013. “Tweeting the spill: Online informal communications, social networks, and conversational microstructures during the Deepwater Horizon oilspill.” International Journal of Information Systems for Crisis Response and Management (IJISCRAM) 5 (1): 58–76. Takahashi, Bruno, Edson C. Tandoc Jr and Christine Carmichael. 2015. “Communicating on Twitter during a disaster: An analysis of tweets during Typhoon Haiyan in the Philippines.” Computers in Human Behavior 50: 392–8. Teodorescu, Horia-Nicolai. 2015. “Using analytics and social media for monitoring and mitigation of social disasters.” Procedia Engineering 107: 325–34. Vieweg, Sarah, Amanda L. Hughes, Kate Starbird and Leysia Palen. 2010. “Microblogging during two natural hazards events: What Twitter may contribute to situational awareness.” In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, 1079–88. ACM. Wallemacq, Pascaline and R. House. 2018. Economic Losses, Poverty & Disasters: 1998–2017. Centre for Research on the Epidemiology of Disasters, CRED. Wang, Nan, Blesson Varghese and Peter D. Donnelly. 2016. “A machine learning analysis of Twitter sentiment to the Sandy Hook shootings.” In 2016 IEEE 12th International Conference on E-Science (e-Science), 303–12. IEEE. Wu, Desheng and Yiwen Cui. 2018. “Disaster early warning and damage assessment analysis using social media data and geo-location information.” Decision Support Systems 111: 48–59. Yang, Jiang and Scott Counts. 2010. “Predicting the speed, scale, and range of information diffusion in Twitter.” In Fourth International AAAI Conference on Weblogs and Social Media. Yates, Dave and Scott Paquette. 2011. “Emergency knowledge management and social media technologies: A case study of the 2010 Haitian Earthquake.” International Journal of Information Management 31 (1): 6–13. Yoo, Eunae, William Rand, Mahyar Eftekhar and Elliot Rabinovich. 2016. “Evaluating information diffusion speed and its determinants in social media networks during humanitarian crises.” Journal of Operations Management 45: 123–33. Yoo, SoYeop, JeIn Song and OkRan Jeong. 2018. “Social media contents based sentiment analysis and prediction system.” Expert Systems With Applications 105: 102–11. Zarrella, D. 2009. “Science of retweets report.” Accessed 20 March 2011 at http://danzarrella.com/the -science-ofretweets-report.html.
16. Does personal data protection matter in data protection law? A transformational model to fit in the digital era Gowri Harinath
1. INTRODUCTION The pervasive feature of modern computing is the high-speed internet connecting everyone with people and also with service providers. The internet has a massive effect in the world, including on how people behave. In addition, the contemporary provocation is the user's erroneous assumption that they control their own data retention and use (Kirley, 2015). The expansion of Information and Communication Technology (ICT) in data collection, processing and storage, resulting in the Internet of Things (IoT), big data and fog computing, escalation of automation in manufacturing, energy, healthcare, urban living and so on, make ICT management and security more extensive than ever before (Bertino, 2021). In addition, the unusual acceleration of the digital revolution during the COVID-19 pandemic resulted in an increase of 4.9 percent from 2020 in the total $332.9 billion spending to meet the remote needs sanctioning the IT field from supporting the business. The number of cyber attacks has dramatically increased since the COVID-19 pandemic began. The pandemic has brought about major cyber risks, not only due to human actions but also to failures of systems and technology (“Gartner Forecasts Worldwide IT Spending to Grow 6.2% in 2021”, 2021). According to IBM Security, customers’ Personal Identifiable Information (PII) ranks highest, with nearly 82 percent of the total data breaches in 2020. Data breaches not only cause hefty fines to be levied on the organisation, which affects the economy’s growth in a larger perspective, but they also cause questions to be raised about the safety, privacy, and trustworthiness of the regulatory bodies in their respective countries (Bertino, 2021). There is a growing risk to personal data protection and privacy of individuals. The increasing use of technology in a wide range of entities to store personal data electronically is leading to a proportionate increase in the risk of identity theft (Groot, 2021). The utilization of new advancements has opened the door for criminals to use personal information illegally (identity theft), increasing law enforcement challenges to regulation authorities around the world. The usage of personal information is the core of identity theft, thus describing personal information protection in greater detail follows the account of identity theft. The enormous loss to the economy caused by data breaches due to the overcomplication of classifying personal data leads to two key questions being raised in this chapter. It is as follows: 1. How is personal data classified in various categories so entities, regulatory bodies and researchers can determine the potential risks? 2. Could a self-controlled and co-evolved model possibly reduce failure due to dynamic circumstances? 267
268 Handbook of big data research methods
2.
HOW DOES PRIVACY MATTER IN DATA PROTECTION LAW?
The boom in machine learning algorithms in the last decade has led to customization becoming so pervasive that most people ignore the fact that their data is aggregated with others to build portals of individuals that predict their interests according to others’ habits. The interpreted selves are not merely the product of our own actions and tastes; they are constructed in recognition of similar patterns among millions of other people. The way machines perceive us depends on the way our data connects to each other. The likes and preferences of people who do not exist in the systems can be readily predicted according to the models of others (Boyd, 2012). The very notion of privacy conflicts with other societal values, legal regimes, and individual independence. It is important to address privacy, a chameleon-like word that covers several aspects ranging from confidentiality of personal information, autonomy in the data process to generating goodwill for its interests. In fact, Privacy and Data Protection are two different sides of the same coin, as stated by Zelman Cowan, that a man without privacy has no dignity (Chesterman, 2012). The concept of privacy goes back to the 1990s, when privacy activists became increasingly concerned about the security of online transaction data through web browsing, the use of credit cards and smart highways. However, in the present circumstances, privacy experts are calling for a comprehensive review of the concept of privacy protection in parallel to the growing technological revolution. (Kesan et al., 2013). Privacy is a fundamental right with regard to freedom, democracy, psychological well-being, individuality, and creativity (Solove, 2008) The United Nations International Covenant on Civil and Political Rights states that: “No one shall be subjected to arbitrary or unlawful interference with his privacy, family, home or correspondence, nor to unlawful attacks on his honour and reputation. Everyone has the right to the protection of the law against such interference or attacks.” (International Covenant on Civil and Political Rights, 1976, art. 17). Privacy is known to everyone, yet unknown, which is causing privacy to die repeatedly. A key human concern is that people want to be left alone with their secrets yet, ironically, they do not appear to be particularly vigilant when it comes to protecting them (Kirley, 2015). As stated by Calvin Gotlieb, “most people, when other interests are at stake, do not care enough about privacy to value it” (Solove, 2008, Ch. 1, p. 5). 2.1
A Comparison of Privacy Concepts
Privacy means different things to different people, so to address crucial concerns of privacy, the conceptualization of privacy in policymaking and the legal regime could serve larger purposes effectively (Chesterman, 2012). The 1890 Harvard Law Review article originally wrote about the concept of “the right to be let alone”, one of the most influential articles articulating the “Right to Privacy” (Warren and Brandeis, 1890). This is an intuition-based theory to formulate a highly pluralistic form of privacy. One such is the “family resemblances” taxonomy theory relating to the cluster of privacy insulated events. For example, Daniel Solove’s “taxonomy of privacy” theory focuses on specific interference or disruption to privacy based on Ludwig Wittgenstein’s “family resemblances” notion. Some things may not share a common feature, but they are nonetheless “connected with each other in many different ways”. For
Does personal data protection matter in data protection law? 269 example, family members usually share features with one another such as eye colour, size, facial structure, hair colour, and so on, although they may not have a common trait. Therefore, they have a complex network of overlapping and over-crossing commonalities. Eliminating a myriad of traditional concepts of privacy. Instead of drawing from a single similar element, Wittgenstein demonstrated the idea of a common pool of similar elements. Also suggested that this principle could be used to protect personal information for a long time, avoiding variability and deciphering the complexities of privacy theories. Thus, emphasizing on privacy problem from the bottom instead of top-down approaching in four dimensions to conceptualize privacy: 1. information collection: surveillance and interrogation; 2. information processing: aggregation, identification, insecurity, secondary use, and exclusion; 3. information dissemination: breach of confidentiality, disclosure, exposure, increased accessibility, and blackmail; 4. invasion: intrusion and decisional interference. The analysis of infringement of privacy pluralistically depending upon social importance of the value of privacy instead of uniform common denominator results in stable yet flexible ways to accommodate changing attitudes towards privacy (Solove, 2008, p.5). But there are strong disagreements with such taxonomy concepts, criticized as “the Cheshire cat of values: not much substance, but a very winning smile”, Jonathan Franzen’s pithy account of privacy (Chesterman, 2012, pp. 392–3). Chesterman suggests there is a requirement for the redevelopment of privacy as “the concrete, the factual and the experienced situations'' from the bottom stressing the coherent and dynamic aspect of privacy. In addition, the taxonomy principle by Solovo is also a bottom-up approach aiding to a lucid, comprehensive, and concrete creation of law and regulation for privacy issues of personal information. The process of collection, use, or dissemination of information underpinned the interpretation of privacy. There are functional restrictions of privacy in democracies like the USA and Europe and in Asian countries including Taiwan, Hong Kong S.A.R., South Korea, Malaysia, and India. Furthermore, Europe’s Data Protection Directive has a significant influence on other regimes according to its broader definition of the right to privacy as “respect for private and family life”. However, the European Court of Human Rights should be amendments to this definition concerning specific protections, concluding that it leads to unsatisfactory development of law.
3.
THE EFFECTIVENESS OF PERSONAL INFORMATION IN THE EUROPEAN GDPR
Personal data is defined broadly in the European General Data Protection Regulation (GDPR) as “any information relating to an identified or identifiable natural person” (Art. 4 GDPR – Definitions – General Data Protection Regulation (GDPR), 2021, para. 1). This definition could be easily adapted for wider coverage and flexibility. The definition consists of the content, purpose and the result of personal data accumulated (Yuvaraj, 2018). Furthermore, the GDPR has specific definitions for biometric, genetic and health information, online iden-
270 Handbook of big data research methods tities, IP addresses to identify a person, and right to be forgotten, making it the most advanced definition of personal data and information in the world. Cases such as Google v Spain (2014) strengthen the ‘Right to be Forgotten’ which is about individuals’ right to have personal information erased in the online environment, then extending to the controller responsible for processing the data the obligation to erase any links, copies or replications of the personal data. Furthermore, Facebook v Ireland (the Safe Harbour Decision, 2016) – invalidating the legal mechanism for transferring EU consumers’ personal data to Silicon Valley and elsewhere in the world – was based on market location principle. Additionally, the online identifiers such as IP addresses, cookie identifiers and radio frequency identification tags all fall under the personal data umbrella (Custers et al., 2019). Despite this, the UK’s Information Commissioner’s office considers names as non-personal data. Although one piece of data doesn’t track down an individual, combined with other data it can reveal pertinent information. For instance, a data controller requesting information about people viewing products on their website may request that they enter their date of birth/age to narrow down search options. Also, a different organization could ask for the person’s occupation, which alone cannot be utilized to identify the individual, yet, when synthesized, can narrow down the coverage. Even with two pieces of information it is not considered personal data but it is improbable to store data without specific identifiers such as gender, locality or payroll number (McGavisk, 2021). It is understood that GDPR guidelines could protect privacy, but it is less likely that link this would apply to the potential risks to individuals that are not potential lucrative to big data users like medical research, money laundering policy, infrastructure planning etc (Custers et al., 2019). A wider approach more focused with the context ‘personal information’ necessitates a holistic model where the data could be traced easily. Consequently, the manner in which personal data is collected, stored, accessed and corrected is traced and the risks linked to individuals. A model is needed which adapts to the changing environment yet is systematic where units naturally join the system so there is no need of constant regulation or scrutiny to protect personal data. an internal structure that voluntarily organizes itself can also provide a picture of the origins of personal data. And also, having an internal structure that voluntarily organises itself providing a picture of the origins of personal data. An effective and efficient system that needs no regulatory changes but little amplifications by entities and the government the way personal data is classified (Adrian, 2012). Data integration and sharing are impeded by legitimate and pervasive privacy concerns. Organizations would be able to share information to stimulate productivity; however, they are prevented from doing so by the fear of being exploited by competitors or by securitization concerns. Legislation, social norms, markets, and technologies all need to come to a standstill as part of the overall solution. The feasibility of ethical self-regulation is reinforced by the user’s capacity to rely on code to manage certain externalities and protect its environment from others’ intrusions. The code should not take the place of conscience but the code can support self-regulation when applied judiciously and prudently to protect privacy rights.
Does personal data protection matter in data protection law? 271
4.
CAN THE MODEL ADDRESS THE TWO KEY QUESTIONS? CAS MODEL ANALYSIS
4.1 Introduction Data protection is conglomerate and exhaustive because the administration of personal data encompasses vast, diverse and boundless propositions. Whistle-blowers and public reports show that regulation and mandatory institutional practices relating to personal data fail to notify the governing law (Adrian, 2012). The Complex Adaptive System (CAS) is a heuristic way that could pave the way for further legal research and adaptation to sufficiently investigate perpetual issues concerning personal data protection faced by lawmakers. The model helps to understand the accumulation of personal data from every action of individuals and how these are linked. Once the connection between various data items is established, it is easy to distinguish between personal information and non-personal information of that individual, thus application of the privacy guidelines to the potential scenarios and drawing conclusions is effortless. It provides compelling examples from the past that demonstrate why CAS could be adapted and a comprehensive analysis of the CAS model that leads to reasonable conclusions that are consistent with the viewpoint (Zhang & Schmidt, 2015). 4.2 Background The traditional perspective is inadequate to address the turbulent circumstances and thus, complex perspectives provide an elevated vision of unforeseen situations and fake information when encountered. The article demonstrates adapting Complex Adaptive System (CAS) to handle personal data. It consists of five sections: the introduction, background, possibilities of complex theory, sketches a networked character of the community addressing personal-data protection laws namely, Personal Data Community (PDC) followed by a detailed illustration of nature of CAS model. The central idea is the applications of complex theory, analysis of the PDC to identify as CAS. Later, suggesting that a further research on practical analysis of CAS model would strengthen the theory of CAS and its adaption in Personal Data Protection Law. The article does not intend to acknowledge specific legislation or jurisdictions thus, views personal data in a global bubble giving examples of the United States, Europe and China personal data classification. 4.3
Meaning of CAS
A Complex Adaptive System (CAS) is a comprehensive network of varied components with basic rules of operation that helps to operate a system without any central control, leading to a complex aggregate behaviour. The system can have sophisticated information processing, further adjusting to evolution efficiently. 4.4
Application of CAS to Personal Data Handling
The PDC possesses the same attributes as the CAS model by combining the properties and examples of existing CAS models with PDC; we could say that it consists of a complex web of interlinked agents forming “hubs”. A comprehensive application of the key CAS properties
272 Handbook of big data research methods to the PDC and its subject matter helps to break down personal data which is often ignored by legal scholars and regulatory bodies. The CAS model addresses the two key issues of this chapter: 1. How is personal data classified in various categories so entities, regulatory bodies and researchers can determine the potential risks? 2. Could a self-controlled and co-evolved model possibly reduce failure due to dynamic circumstances in the future? For example, the top five data breach sectors are health, finance, banking, education and legal (Urrico, 2019). These sectors are classified in various sub-categories in Figure 16.1 so organizations, regulatory bodies, decision-makers and individuals can track the origin of every personal data unit. After tracking, the risks involved in each unit can be evaluated based on the level of personal information shared in these sub-categories. This tracing method is efficient because the units in these categories show that every piece of personal information shared is different and must be treated differently for effectiveness. This model also resembles the “taxonomy of privacy” concept based on “family resemblances”, as explained earlier in the privacy topic. The model keeps expanding to the changing environment and learns from evolution of its units. The units are added not only from past events but from future predictions or actual occurrences of the event in the future. The units and new categories join the cluster without any central control over the model. It gives flexibility as the model is open to new additions in future and helps to trace personal data and its extremity in each category. In addition, there is no need for significant changes or amendments to the privacy regime as it self-organizes the source of data breaches. It further segments into different fields where the data breaches occur and divides each segment into categories that have various sub-categories. Just the articulation of personal information into various categories, subcategories and units is needed. This model gives importance for data collection which is primary, but gives a whole new direction to data protection by explaining how data sets in each segment could be secured, thus establishing a multi-layered protection by the time it enters the huge cluster that is collectively called personal data (Zhang and Schmidt, 2015). 4.5
Can CAS Theory be Applied to the Personal Data Community?
The Personal Data Community is complex and not as straightforward as it appears because to consider all data users includes individual and institutional users, which have further subunits. These subunits constitute individuals using data, and organizations’ data users, namely: banks, business (big and small), large multinational organizations, government, non-profit organizations, and so on. The PDC is a web of webs, hence, it is highly significant to understand how PDCs are formed from units/sub-PDCs and other subunits/sub-PDCs. The PDC has the attributes of CAS theory such as: systematic, dynamic and complex. Table 16.1 details the elements to decide whether PDC is a CAS or not.
Does personal data protection matter in data protection law? 273
Table 16.1
A matrix on PDC vs. CAS
CAS model
PDC
Systematic
A CAS is a Whole, has, aggregates
Networked, diverse, signalling, metabolizing CAS (themselves)
agents that are:
often
Dynamic
A CAS is adaptive but yet it is sensitive to
(co)-evolution, learning, critical transitions
Complex
A CAS shows emergent behaviour that is often
Without central control, path dependent, non-linear
Figure 16.1
Interconnection of Personal Data segments
Systematic According to Merriam-Webster’s Collegiate Dictionary, a system is “a regularly interacting or interdependent group of items forming a unified whole: as a gravitational system, thermodynamic system, digestive system, river system, a computer system, capitalist system”. Similarly, CAS is also a unified whole with a collective network of agents, the signal operating with basic rules. When considering something to be CAS, it is important to have a whole holding identity with boundaries and some internal coherence. Likewise, a weather forecast system has an atmospheric domain having an isobar, a line connecting points of equal pressure, as its boundary. As mentioned earlier, a PDC is a network of networks constituting large amounts of data by data users linked in the world whose boundaries are determined by any “further links to responsible individuals interested in personal data”. The diagram shows how many data users there are within PDC fluctuate.
274 Handbook of big data research methods Dynamic CAS can adapt and co-evolve with its environment because no unit is independent and is linked interconnectedly. As the environment changes, CAS also adjusts to the change and vice-versa. So, the environment and CAS runs parallel and reciprocates to change. PDC possesses the same attribute of dynamic and co-evolving. The PDC’s co-evolution is explained with Lessig’s model in Cyberspace which recognises four constraints: law, market, technology, and culture. PDC could possibly include changes in the elements like technology by variations in agent behaviour, improvement in national security after the 9/11 incident leading to change in data protection and privacy, specifically ‘right to be left alone’ protection. Complexity The aggregate interconnections of networks in a PDC are not just collections of individual personal data but are like “a cat’s cradle of interaction” between dynamic units without complication. However, the complex system is not controlled centrally, and is modelled in a non-linear. A CAS has an internal structure and a dynamic history of evolving and learning by adapting to the environment. So is the PDC, displaying self-organization, meaning that varied units and subunits join the system voluntarily without any influence from individuals internally or externally. For example, the accumulation of personal data by individual and community actions and communications of service providers, technology providers, businesses and individuals using social networking sites like Facebook form a gigantic community through self-organization of data. Self-organisation is beneficial to PDC because of evolution and development without central control on individuals. Legal scholars could research PDC behaviours without focusing on the subject of individual accountability for control systems and further taking into account the strengths and weaknesses of hubs in the networks, which would be helpful when applied to pandemics like Covid-19 by analogy to regulatory approaches. Figure 16.2 clearly illustrates that the PDC is systematic, dynamic, and co-evolves externally with different units by cooperating, interacting, and developing related to personal data. The process of co-evolution to continuously adapt by learning and making conscious behaviour choices is advantageous for the PDC without central control in a system. Therefore, the CAS gives a unique perspective on the PDC, otherwise considered as unapproachable. It is proved that the PDC is a systematic collection of data from various agents and monitoring for prediction of agents’ behaviours; a dynamic, changing environment leading to involuntary co-evolution; and complex, without main control and leaning towards non-linear model characteristics. This model could fill the gaps between changing environments, leading to constant expansion of rules in legal regimes and its subject matter as compared to a traditional theory. The CAS model could provide a detailed analysis about where the data originated, its connections, and the patterns of the types of data breach by entities. It gives logical and discernible classifications and sub-categories of personal data for flawless personal data protection, since it is mandatory for entities to be aware of the ICT environment to make businesses possess knowledge about types of information: creation, retainment and storage. Here the issue is what data is called personal and its links to an individual. Figure 16.2 not only lowers the probability of misuse, interference, loss, and unauthorized access but also reduces constant governance by regulatory bodies.
Does personal data protection matter in data protection law? 275
Figure 16.2
5.
Illustration for Personal Data elements
LIMITATIONS OF THE CAS MODEL
The law is reactive and slow, especially in the face of rapidly changing technologies. It is often incomplete and vague, formulated quickly to “fix” a problem of public concern. Is this move a push for the PDC to get out of control leading to unhanding by the regulation bodies despite the possibility that CAS perspective has a complex, dynamic, and systematic approach that any personal data management requires moving the data protection law into an advanced and intriguing dimension? CAS is just beginning and before considering as the face of any regulation it requires further extensive research on CAS – data protection law. Hypothetical trial and error of huge samples must be approved before accepting any model into the regime. The legislation requires continuous taming until the CAS model completely fits into the regime so while making changes it is valid to consider the reason it came into force and significance on the regime. Thus, a well-defined road map for adaptation of this model requires a considerable amount of resources to improve the quality of future data protection law.
6. CONCLUSION There are several advancements by many jurisdictions to address issues concerning personal data protection around the world. But there is a constant increase in the cost of data breach where significant proportion is contributed by identity thefts. Invasion of privacy for personal information misappropriation has questioned the privacy concepts but ‘taxonomy of privacy’ based on ‘family resemblances’ notion avoids confusion assimilating the complexities. The concept of privacy is tremendous so emphasising privacy considering personal data protection
276 Handbook of big data research methods could address a wide range of issues. The agencies buying ‘commercialised’ personal information from organisations increases the risks and raises doubts about the system's dependability. A series of incidents has influenced continuous advancements in the data protection law by several jurisdictions. The growing risks concerning personal data protection are largely due to the application of traditional concepts to the new issues faced in the digital age. The key issues addressed in this chapter are: 1. How is personal data classified in various categories so entities, regulatory bodies and researchers could determine the potential risks? 2. Could a self-controlled and co-evolved model possibly reduce failure due to dynamic circumstances in the future? A detailed analysis in this paper represents that there is still a need for a model which could voluntarily adapt to dynamics. The chapter recommends the CAS model which co-evolves in the dynamic environment and is systematic covering a set of categories, sub-categories and units joining the big pool of personal data. It is a composition of communication interconnections among the agents are involved. The CAS model gives new dimension to Australia as it involves categories that could evaluate the potential risks of every unit in the personal data without much governance or amendments in the regime. However, requires resources, further research, and sample tests before implementing practically. This paper contributes in the way of personal data handling and uplifts the traditional perspective. Therefore, this paper conveys the inefficiency of personal data protection regimes, thus adaption of the CAS model could be suitable for the growing digital world.
REFERENCES Adrian, A. (2012). ‘Has a digital civil society evolved enough to protect privacy’. Retrieved 29 November 2021, from https://journals.sagepub.com/doi/abs/10.1177/1037969X1203700309?journalCode=aljb. Art. 4 GDPR – Definitions - General Data Protection Regulation (GDPR). (2021). Retrieved 2 December 2021, from https://gdpr-info.eu/art-4-gdpr/. Ayres, A. (2006). Is India Emerging as France of Asia? At that juncture, France took a stand that though it is the ‘ally’ with the US, it is not aligned with the latter (pp. 1–2). Yale Global Online. Baruah, D. M. (2019). Sister Islands in the Indian Ocean Region: Linking the Andaman and Nicobar Islands to La Réunion. Texas National Security Review. Bertino, E. (2021). ‘Security Threats: Protecting the New Cyberfrontier’, Purdue University. IEEE Xplore Full-Text PDF. Retrieved 19 November 2021, from https://ieeexplore.ieee.org/stamp/stamp .jsp?tp=&arnumber=7490312. Boyd, D. (2012). ‘Networked Privacy’ Surveillance & Society; Kingston, 10(3/4), 348–50. https:doi/10 .24908/ss.v10i3/4.4529. Chen, Y. (2016, July). South China Sea: The French Are Coming. The Diplomat. Chesterman, S. (2012). ‘After Privacy: The rise of Facebook, the fall of WikiLeaks, and Singapore’s Personal data protection act 2019 (pp. 391–415). Singapore Journal of legal studies. Retrieved 9 November 2021, from https://law1.nus.edu.sg/sjls/articles/SJLS-Dec-12-391.pdf. Custers, B., Sears, A., Dechense, F., Georgieva, I., Tani, T. & Hof, S. (2019). ‘EU personal data protection in policy and practice’. Information technology & law series, 29(1.1). Springer: Netherlands. Gartner Says 4.9 Billion Connected “Things” Will Be in Use in 2015. (2021). Retrieved 29 November 2021, from https://www.gartner.com/en/newsroom/press-releases/2014-11-11-gartner-says-nearly-5 -billion-connected-things-will-be-in-use-in-2015.
Does personal data protection matter in data protection law? 277 Gartner Forecasts Worldwide IT Spending to Grow 6.2% in 2021. (2021). Retrieved 29 November 2021, from https://www.gartner.com/en/newsroom/press-releases/2020-01-25-gartner-forecasts-worldwide -it-spending-to-grow-6-point-2-percent-in-2021. Google Spain SL, Google v Agencia Espanola de Proteccion de Datos, Mario Costeja Gonzalez. (2014). C-131/12 ECLI:EU:C:2014:317 (Spain). Retrieved 29 November 2021, from http://curia.europa.eu/ juris/document/document.jsf?text=&docid=152065&doclang=EN. Groot, J. (2021). The History of Data Breaches. Retrieved 29 November 2021, from https://digitalguardian .com/blog/history-data-breaches. Haass, R. (2008). The Age of Nonpolarity: what will follow U.S dominance (p. 44). Foreign Affairs. Harsh v. Pant (2018, March). Macron and Modi: What France can do for India and What India can do for France. The Diplomat. Indian Express. (2018). Editorial. 12 March. International Covenant on Civil and Political Rights opened for signature 16 December 1966, 2200A UNTS 221(entered into force 3 January 1976), art 17. Juppé, A. (2000, May). French Defence Minister, Speech at the United Services Institutions of India: New Delhi. Kapila, S. (2012). South Asia: France Moves Strategically towards India, Paper No. 127. South Asian Analysis. Kesan, J., Hayes, C. & Bashir, M. (2013). Information Privacy and Data Control in Cloud Computing: Consumers, Privacy Preferences, and Market Efficiency, 1(1) 6. Kirley, E. (2015). Reputational Privacy and the Internet: A Matter for Law? 5–7 York University Osgoode Hall Law School of York University. Krishnamurthy, B. (2005). Indo-French Relations: Prospects and Perspectives, (pp. 68–9). Shipra Publications: New Delhi. Mallik, A. (2016). ‘Future of Defence Technology and India’s Priorities for Self-Reliance”, in Shrikant Paranjpe (Ed.), India’s Defence Preparedness (pp. 165–6). Pentagon Publishers: New Delhi. Manikandan, K. (2006). “Indo-French Defence Cooperation: Experiences, Trends and Opportunities” in B. Krishnamurthy (Ed.), India and France: Past, Present and Future (p. 103). Pondicherry. McGavisk, T. (2021). The Positive and Negative Implications of GDPR in the Workplace. Retrieved 18 November 2021, from https://www.timedatasecurity.com/blogs/the-positive-and-negative -implications-of-gdpr-in-the-workplace. Ministry of Defence, (Government of India). (2015, October). Ensuring Secure Seas: Indian Maritime Security Strategy, (p. 32). Indian Navy. Ministry for the Armed Forces, (Government of the Republic of France). (2019, May). France’s Defence Strategy in the Indo-Pacific. Ministère de l’Europe et des affaires étrangères. (2019, July). Indo-French Air Exercise “Garuda VI”, Paris, Mont-de-Marsan Airbase 118. Ministère de l’Europe et des affaires étrangères. (2019, October). Indo-French Joint Army Exercise “Shakti 2019”. Ministère des Armées. (2019, May). France and Security in the Indo-Pacific, p. 17. Mishra, V. (2018). India’s Nuanced Indo-Pacific Strategy. South Asian Voices, 1 November. Pubby, M. (2020). French defence company Thales to ramp up India operations. The Economic Time, February 7. Racine, J. L. (2002). The Indo-French strategic dialogue: bilateralism and world perceptions, Strategic Studies, 25(4), p. 161. Roger, C. (2007). Indo-French Defence Cooperation. Friends in Need or Friends Indeed. IPCS: New Delhi, pp. 1–32. Saint-Mézard, S. (2015, March). The French Strategy in the Indian Ocean and the Potential for Indo-French Cooperation (p. 7). Singapore: Rajaratnam School of International Studies, Nanyang Technological University, Policy Report. Sawhney, A. (2020, May). India and France: A joint step forward. Observer Research Foundation. Sibal, K. (2012). India’s defence ties with Europe (p. 3). Indian Defence Review. Singh, Y. (2019). France looks to include India and several other countries as permanent UNSC members, The Print, 7 May.
278 Handbook of big data research methods Solove, D. (2008). ‘Understanding Privacy’ 5(4), Harvard University Press Legal Studies Research Paper No.420. Retrieved 12 November 2021, from https://scholarship.law.gwu.edu/cgi/viewcontent .cgi?article=2075&context=faculty_publications. Stockholm International, Peace Research Institute (SIPRI). (1971). The Arms Trade with Third World, (p. 475). Almqvist and Wiksell: Stockholm. Strategic Digest. (1999). IDSA: New Delhi, 24(1), pp. 66–7. Times of India. (1982). New Delhi, 19 October. Times of India. (2018). India, France sign Strategic Pact on Use of Each Other’s Military Bases, 10 March. Unnithan, S. (2004). India’s largest naval exercise with France marks possibility of huge defence purchases. India Today, 10 May. Retrieved 12 May 2020, from https://www.indiatoday.in. Urrico, R. (2021). ‘Top 5 Data breach trends for 2020’ (2019) Credit Union Times: New York. ‘What is personal information’ Office of the Australian Information Commissioner (OAIC). Retrieved 10 November 2021, from https://www.oaic.gov.au/privacy/guidance-and-advice/what-is-personal -information/. Vergeron, K. L. (2015). India and the EU: what opportunities for defence cooperation? (pp. 2–3). EUISS, Issue Brief. Varuna 2019: Indo-French joint navy exercise concludes in Goa. The Indian Express, 10 May. Vergeron, L. (2006). Contemporary Indian Views of Europe (p. 10). Seven Bridges Press. Warren, S., Brandeis, L. (1890). ‘Right to Privacy’. Harvard Law Review, 4(5), pp. 193–220. Retrieved 8 November 2021, from http://links.jstor.org/sici?sici=0017-811X%2818901215%294%3A5%3C193 %3ATRTP%3E2.0.CO%3B2-C. Yuvaraj, J. (2018). ‘How about me? The scope of personal information under the Australian Privacy Act 1988’, (pp. 47–66). Computer Law & Security Review 34. Zhang, K., Schmidt, A. H. J. (2015). Thinking of data protection law’s subject matter as a complex adaptive system: A heuristic display. Computer Law & Security Review: The International Journal of Technology Law and Practice Review 31(2), pp. 201–20. https:doi/10.1016/j.clsr.2015.01.007.
17. Understanding the Future trends and innovations of AI-based CRM systems Khadija Alnofeli, Shahriar Akter and Venkata Yanamandram
1. INTRODUCTION Customer relationship management (CRM) can be a very powerful Information Technology (IT) tool that helps achieve business success and maintain an ongoing relationship between the customer and the company. It is also a strategic enabler, encompassing corporate goals, and increasing customer satisfaction and retention, which leads to organisational success (Bohling et al., 2006; Kotorov, 2003; Long and Khalafinezhad, 2012; Payne and Frow, 2005). According to Leigh and Tanner (2004), building a successful organisation strategy can be accomplished through generating a customer-centric culture and focusing on customer loyalty. The dramatic growth of Artificial Intelligence (AI) systems has seen organisations renew their focus on AI-based CRM. For example, in a survey of 326 employees of Indian agile organisations, Chatterjee et al. (2021c) identified the importance of having a clear AI-based CRM process integration to create value and improve the competencies of agile organisations. The importance of using AI initiatives in data collection, data analytics and prediction models, has resulted in many companies (for example, Amazon, Apple, LinkedIn, Netflix) employing a robust cloud servicing platform to accommodate requests from millions of customers (Akter et al., 2021). Through utilising existing data captured from each transaction across multiple touchpoints they deliver value and provide their customers with uniquely seamless experiences (Bradlow et al., 2017). This has led the organisations to enhance the CRM technology approach and adapt intelligent AI systems to boost customer experience, improve customer lifetime value and drive business success through a data-driven decision-making process (Bradlow et al., 2017; Ngai, 2005; Saura et al., 2021). According to Saura et al. (2021), AI-based CRMs consist of sophisticated systems that capture, retrieve, store and analyse customer data, as well as automate processes to forecast customer behaviours and improve the digital ecosystem. Current literature has mainly emphasised the positive impact of AI-based CRM on organisational performance, profitability and success (Baabdullah et al., 2021; Chatterjee et al., 2021c; Guerola-Navarro et al., 2021; Hajipour and Esfahani, 2019; Saura et al., 2021). However, to date, there is a limited amount of published literature investigating the multidimensional AI-based CRM concept influencing employees’ satisfaction level, customer churn rate and organisational productivity and competitiveness. There is a dearth of studies that focus on AI-based CRM dimensions, with few current studies focusing on organisational and management capabilities (Chatterjee et al., 2021c; Libai et al., 2020; Hallikainen et al., 2020); furthermore, they all focus on only one side of the dimension, such as CRM innovation and customer centricity (Guerola-Navarro et al., 2021), organisational readiness and success (Chatterjee et al., 2021c). 279
280 Handbook of big data research methods Drawing on the highlighted research gaps, we identify two research questions: the first research question seeks to ascertain the association of AI-based CRM between the organisation and customers’ perspective. The second question aims to understand the part of the dimensions that can influence organisations’ competitiveness. This research contributes to the literature of AI-based CRM through examining existing findings by identifying and proposing its dimensions and sub-dimensions. RQ1: RQ2:
What is AI-based CRM? What are the dimensions of an AI-based CRM in the context of marketing?
2.
RESEARCH METHODOLOGY
This study follows a semi-systematic literature review to provide a comprehensive understanding of the AI-based CRM phenomenon and its dimensions in the context of marketing, by identifying highly influential articles on “AI and CRM Dimensions” which could help provide a better understanding of its current applications. This literature has adopted two phases: Phase one: A comprehensive search of recent and relevant journal articles was carried out using the selected keywords “AI” and “CRM”. Phase two: Following the semi-systematic review procedures (Snyder, 2019), articles were collected from multiple databases, with a major focus on high impact journals between 2015 and 2022, and a descriptive approach was employed to carefully analyse and categorise the dimensions and the elements for the study.
3.
FINDINGS OF LITERATURE REVIEW
This section will review relevant literature on AI-based CRM and discuss the theory that supports the research, further proposing a conceptual framework. 3.1
Defining CRM
Customer Relationship Management (CRM) is a complex combination of technology and business marketing. It has its roots in relationship marketing (Buttle, 2003; Payne and Frow, 2005), focusing on the continuous relationship building between customers and companies, to help boost customer retention and loyalty (see Table 17.1). The success of any CRM platform majorly depends on firms’ initiatives and communications, which help improve their relationship with customers through analytical CRM by analysing existing data and monitoring customer behaviours and satisfaction. CRM is a customer-centric business strategy that aims to obtain, maintain and retain profitable customers (Buttle, 2003). CRM systems were used previously as a tool to organise customer information, which can help boost employee productivity and is hailed as the key to achieving competitive strategy (Bohling et al., 2006). However, measuring CRM remains complex, which makes it difficult to evaluate (Richards and Jones, 2008). As seen from Table 17.1, the definition of CRM has evolved over time, where the literature defined CRM based on its major capabilities including value creation, customer orientation and customer equity.
Understanding the Future trends and innovations of AI-based CRM systems 281 Effective implementation of a CRM initiative depends on the firm’s value proposition, marketing strategy, organisational strategy and inter-organisational corporation (Bohling et al., 2006). CRM is involved in Salesforce Automation (SFA) and it is about understanding and marketing each customer individually; furthermore, it comprises three major functional component areas: Marketing, Sales, Services and Support (Tamošiūnienė and Jasilionienė, 2007). It can also help attract and identify valuable clients and build a long-lasting sustainable and profitable partnership (Guerola-Navarro et al., 2021). Table 17.1
CRM definition
CRM Theme
Definition
Dual Value Creation
CRM is not about selling the product, it is Boulding et al. (2005)
Source
focused on the process of creating value to the customer and, during that process, creating value for the firm “staying in existence”. Customer Orientation
CRM is a core organisational process that
Jayachandran et al. (2005)
focuses on developing, maintaining and improving a long-term association with customers. Customer Equity
CRM is the process of improving the
Richards and Jones (2008)
components of customer equity by increasing the firm’s value equity, brand equity and relationship equity. Strategic System
CRM is an information technology
Kim and Kim (2009)
system that enables core business strategy by linking the knowledge management process and customer intelligence with process management and customer interaction. Customer Centric
CRM utilises extensive innovative
Lin et al. (2010)
strategies to understand customer demands, cultivate advantaged customers and improve satisfaction to maintain a long-term partnership and provide a lasting competitive advantage. Value Creation
CRM is a strategic function of a business
Lipiäinen (2015)
environment where the customer has been included in the value creation processes by increasing customer awareness and engagement at a more meaningful level instead of just managing customers. Business Analytics
CRM is the ability to analyse, integrate and leverage customer feedback and information to help support a better decision-making process and create business value through business analytics.
Nam et al. (2019)
282 Handbook of big data research methods CRM is presented as a technology-based solution that helps in managing customer relationships by customising and personalising services, also obtaining greater customer retention (Jayachandran et al., 2005). According to Troisi et al. (2020), CRM strategies can be enhanced through growth hacking techniques, alongside data mining techniques, which can help execute intelligent decisions and offer customers good quality service that fits their needs. This will assist in driving economic, social and environmental sustainability (Vesal et al., 2021). 3.2
Defining AI-based CRM
Traditional CRM was used to organise customer information and it was more product-centric and profit-focused, whilst the new AI-based CRM is customer-centric, where it can analyse a high volume of customer data in the most effective and cost-efficient way (Chatterjee et al., 2021c; Saura et al., 2021). Currently, AI-based CRM is used for administrative tasks which will enable managers to make improved decisions based on the prediction estimate provided by the intelligent systems (Gligor et al., 2021; Libai et al., 2020). AI can significantly impact the CRM process in multiple disciplines including the automating process (tasks and activities), communication (providing personalised experience), partnership (unlocking and sharing information with the whole ecosystem), price matching (dynamic pricing), which will allow employees to be creative and have time to complete other important tasks (Baabdullah et al., 2021; Herhausen et al., 2020; Saura et al., 2021). According to Chi (2021), AI-based CRM, knowledge management and customer orientation can positively influence innovation capabilities and help maintain long-term customer relationships. Table 17.2 showcases current and existing studies within the field of AI-CRM. Table 17.2 Studies
Selected studies on AI-based CRM Examples of AI-CRM
AI-CRM Markets
Findings
Baabdullah et al. AI-based B2B CRM
AI-CRM Definitions
AI-based CRM can
Empirically examined
AI system enabler is
(2021)
enhance business
B2B SMEs (n = 392) in
significantly influenced by
Saudi Arabia.
employee’s acceptance,
is the process of
improving B2B business performance through performance through
employee experience,
attitude and technology
using AI automation.
B2B engagement and
road mapping but not by the
information processing.
individual’s professional expertise. Proposed conceptual model
Chatterjee et al.
The process of
AI Integrated CRM
Obtained (n = 326)
(2021b)
integrating AI into the
system (AICS) will
usable response from the findings suggests that employees of the Indian
there is a high potential
large customer data to
strategy and will provide
agile organisations who
of AI-CRM to improve
help organisations with
ease of use, simplicity,
adopt AI integrated
the competencies of agile
their decision making.
and self-efficacy for
CRM systems.
organisations.
CRM systems to analyse digitalise organisational
the employee of the organisation.
Understanding the Future trends and innovations of AI-based CRM systems 283 Studies
AI-CRM Definitions
Examples of AI-CRM
AI-CRM Markets
Findings
Libai et al.
The ability of an
Enabling human-like
Critically reviewing
AI-based CRM will help
(2020)
intelligent system to
interactions between
the implications of AI
understand customer needs
interpret and analyse
AI-based CRM systems
capabilities that will
and develop customer
large customer data to
and customers, which will
transfer CRM.
communication, retain an
learn, use and achieve
allow personalised and
existing customer, and acquire new customers
specified tasks and goals faster communication at including customer
through leveraging big
a low cost.
customer data.
acquisition, customer retention and customer Saura et al.
development. AI-based CRM is
B2B companies utilise
Systematic literature
AI-based CRM is used
(2021)
a system that processes
AI-based CRM in
review of 30 academic
in three main B2B digital
multiple tasks which
B2B digital marketing
articles on AI-based
marketing strategies
will help enhance
strategies to process
CRM in traditional and
including: (1) analytical;
the organisational
large-scale data.
digital B2B marketing
(2) operational; (3)
ecosystem.
collaborative.
data-driven decision-making Suoniemi et al.
process. CRM-based technology
Firm-level IT capabilities,
An empirical study
CRM system capability
(2021)
is the process of
operational-level
of the resource-based
has an immediate effect on
leveraging IT
capabilities, system quality theory of CRM system
capabilities to help
and productivity impacts
capability from the
gains and productivity gain
improve customer
CRM projects.
survey of (n = 148) IT
discrepancy.
service, loyalty and
managers and (n = 474)
efficiency.
end-users.
IT capability, productivity
AI-based CRM, alongside its IT capability, can be considered as a multidimensional concept as it doesn’t only work as a system that gathers information, but it also helps acquire, maintain, retain and analyse customers’ information to provide satisfying service for customers, develop long-term customer relationships and increase profitability (Suoniemi et al., 2021).
4.
THE IMPORTANCE OF AI-BASED CRM
Researchers contend that AI-integrated CRM could help improve many aspects of the business such as: accomplishing competitive intelligence through analytical CRM (Nelson et al., 2020), enabling dual value creation (Itani et al., 2020; Libai et al., 2020), improving multichannel interaction (Hallikainen et al., 2020; Saura et al., 2021), effectively optimising and utilising the search engine (Järvinen and Taiminen, 2016; Herhausen et al., 2020; Peco-Torres et al., 2021), enhancing customer experience and customer lifetime value (Hajipour and Esfahani, 2019; Huang and Rust, 2017; Kim and So, 2022; Saura et al., 2021), empowering customer engagement (Agnihotri, 2020; Kumar, 2020) and improving organisational performance (Baabdullah et al., 2021; Chatterjee et al., 2021b; Obaze et al., 2021) by utilising AI-driven insights from the intelligent CRM. AI-based CRM helps recognise customer patterns by using real-time data exchange across all channels of interactions and creating a personal-level bond with the user (Kumar et al., 2019), and provides a clear 360-degree view of the customers from identification, attraction, development and retention (Guerola-Navarro et al., 2021).
284 Handbook of big data research methods AI-based CRM systems are designed to interact with customers like frontline agents (Gursoy et al., 2019). Table 17.3 highlights examples of current AI-based CRM processes. Firms with strong AI-based CRM tend to have better organisational performance and facilitate quality customer interaction (Obaze et al., 2021), solve customers’ problems without any human involvement, lower employees’ workload and increase operational efficiency and effectiveness (Gursoy et al., 2019). Nevertheless, AI-based CRM also seeks product improvements by developing an AI forecasting model to help improve the performance of the product or service (Toorajipour et al., 2021), and offer dynamic pricing flexibility based on consumers’ requirements (e.g., Delta Airline ticket flexibility and bundle arrangements) (Bildea and Gorin, 2017). Table 17.3
AI-based CRM processes
AI-based CRM Processes Definition
Example
Author
Generate Brand
The customer’s ability to understand,
The apps of Amazon, Nike, Starbucks,
(Gustafson and
Awareness
recall and recognise the brands’
or Coca-Cola use AI systems to track
Pomirleanu,
narratives which allows the
customer purchasing behaviour and
2021; Kumar et
organisation to collect data on
predicts their likelihood to purchase other
al., 2019)
consumer behaviour, understand
similar items, by using multi-channels
their preferences and gain brand
to promote a product or a service where
legitimacy.
the customers are already familiar with the brand.
The process of identifying potential
Salesforce and HubSpot tools can capture
(Järvinen and
customers through cookies, search
leads through predictive analytics and
Taiminen, 2016)
engine and their contact information.
customer insights.
Is the probability of converting
HubSpot, Salesforce or Marketo software
(Järvinen and
a potential quality lead into an
assist employees in identifying and
Taiminen, 2016)
opportunity and possibly a high
qualifying leads by targeting potential
paying loyal customer.
customers through predictive lead scoring
Cross-Buying and
Is the process of upgrading or
and profiling personalised customer data. Ryanair implements a sophisticated
(Ahmad et al.,
Up-Selling
purchasing (cross-sell) various
AI platform that captures customers’
2022; Bildea
additional products or service from
purchasing history and flight activities to
and Gorin, 2017;
the same brand or organisation,
provide ancillary services (Cross-Buying)
Hossain et al.,
which enhances customer retention,
or recommend flight upgrades
2020; Kumar et
customer lifetime value and produces
(Up-Selling).
al., 2008)
revenue. The process of automating customer
Apps such as GrubHub, Menulog
(Chi et al., 2020;
interactions through intelligent AI
or Uber Eats use advanced data
Gursoy et al.,
devices that provides high quality,
capabilities and high processing speeds
2019)
consistent and timely service.
to optimise delivery time. Furthermore,
Leads Generation
Automated Lead Scoring
Service Delivery
IBM Watson’s enabled self-service technologies help customers choose the most suitable jacket.
Understanding the Future trends and innovations of AI-based CRM systems 285 AI-based CRM Processes Definition
Example
Author
Personalisation
Also known as customisations, it is
Hilton Hotel worldwide personalises their
(Gursoy et al.,
a data-driven approach that tailors
customers’ experience and addresses
2019; Järvinen
customers’ preferences with a specific their concerns by employing a robotic
and Taiminen,
product or service to accommodate
concierge “Connie” and using the Hilton
2016; Kumar et
their needs, where AI works on
Honors guest app.
al., 2019)
a personal level and creates a bond with the user to boost brand trust and provides a superior brand experience by positively influencing customer engagement. The purpose is to analyse a huge multi-variance dataset and design content to meet customers’ Providing
expectations. The use of existing customer data
AI algorithms that generate suggestions
(Bildea and
Recommendations
and intelligent algorithms to study
by leveraging real-time data, for
Gorin, 2017;
customers’ purchasing patterns and
example, Pandora provides music
Kumar et al.,
user behaviour to propose a product
recommendations, Amazon suggests
2019)
or a service which will enhance
products based on “customers who
their experience whilst improving
bought this item also bought this”, Netflix
organisational performance.
provides movie recommendations based
Virtual Reality
on viewing history, etc. Virtual reality showrooms provide the Kiawah Island Real Estate provides their
(Syam and Sharma, 2018)
customer with 360-degree augmented
prospect customers with a personalised
reality video using machine learning
home plan virtual reality tour that is
algorithms, which can enable the
customised with colours and pieces of
customer to learn about the products
furniture, which resembles a real physical
and it can customise their buying
present experience.
Virtual Assistants
experience. An interactive AI dialogue software
KLM Airlines “Spencer”, Microsoft
(Gursoy et al.,
(Chatbots)
that communicates with customers
“Cortana”, Amazon “Alexa”, use
2019; Kumar et
and provides diagnoses based on case
sophisticated Robo-advisors apps to
al., 2019; Syam
history and previous data to solve and
answer their customers’ queries and
and Sharma,
Customer Referral Value
enhance their experience. recommend solutions to customers. The process of converting prospective American Express pays $100 for each
2018) (Agnihotri, 2020;
customers into actual paying
referred customer, and HelloFresh
Meyners et al.,
customers, which can be made
has a HelloFriends programme where
2017)
through target seeding strategies or
the customers obtain a $50 voucher
reward referrals.
as a referral reward. These companies send automated messages and emails to their customers to stimulate the referral process.
286 Handbook of big data research methods AI-based CRM Processes Definition
Example
Author
After-Sales Service
The process of supporting customers’
Toyota, BMW and other automobile
(Ko et al., 2017;
needs throughout the customers’
companies provide periodic maintenance,
Shokouhyar et al.,
life cycle, by utilising after-sales
automated messages/email to their clients
2020)
service data to validate predictive
for service repair, and by taking advantage
performances and maintain a high
of the data mining technique to cluster and
level of quality services which
analyse customers with similar behaviour.
are delivered after the delivery of a certain product or service for the purpose of customer retention, influencing customer satisfaction and boosting organisational Service Recovery
competitiveness. The process of proposing new ideas
Slow internet, the chatbot or virtual
due to service failure and actively
assistant will apologise to the customer
carrying out a service recovery to
and try to solve the issue by automatically
meet customer expectations and
rebooting the system without human-agent
change the customer’s state of
interaction.
(Lv et al., 2022)
dissatisfaction.
5.
AI-BASED CRM DIMENSIONS
In recent studies, scholars have focused their research mainly on determining the organisational and technological factors that have influenced the CRM processes. According to Chen and Popovich (2003), CRM is an integrated, customer-driven technology that consists of three key dimensions, including people, processes and technology. It is a technology-based strategic approach that enhances customer value through developing an appropriate relationship with key customer segments (Payne and Frow, 2005). AI-based CRM is a multi-dimensional construct. Table 17.4 gives details of previously published studies on CRM dimensions and sub-dimensions obtained from the period of 2003 to 2021. Whereas some studies have focused on general aspects of the CRM dimension within the context of innovation without the inclusion of AI (Hillebrand et al., 2011; Lin et al., 2010; Payne and Frow 2005; Tamošiūnienė and Jasilionienė 2007), this chapter proposes four main dimensions of CRM for ensuring value creation, firm profitability, and business success. Scholars contend that AI-based CRM’s main objective is to improve the management of customer relationships and facilitate better commercial results for the firms, by effectively establishing excellent information management processes, quality customer-centric relationships, and good channel distribution (Guerola-Navarro et al., 2021). Payne and Frow (2005) acknowledge that by using a hybrid channel model, companies can interact with their customers constantly through a growing number of channels which include, direct mail, internet, mobile apps, salesforce, and so on. CRM is currently used as a tool to assist in increasing sales passively, where it can track customer activity and their spending patterns (Tamošiūnienė and Jasilionienė, 2007). Moreover, it is a system that strengthens the business relationship with customers, at the same time as it helps in reducing costs, increasing productivity and profitability (Meena and Sahu, 2021). Furthermore, CRM is about understanding the customer and personalising the products to
Understanding the Future trends and innovations of AI-based CRM systems 287 satisfy customer preferences by ensuring commitment, trust, loyalty, and satisfaction in the customer–organisation relationship. According to Libai et al. (2020), AI-based CRM capabilities includes leveraging big customer data, communicating, understanding, and creating human-like behaviour. Chatterjee et al. (2021a) emphasised the importance for organisations to have accurate data analysis on the CRM activities, where AI plays a major role in analysing big data in a cost-effective way, which helps achieve major business success. Organisations should consider B2B relationship management as the main strategy for a long-term customer relationship, which can impact business profitability (Chatterjee et al., 2021b). Table 17.4
Dimensions and sub-dimensions of AI capabilities
Dimensions
Sub- Dimensions
Definitions
References
Organisational
Business Centricity
The ability to successfully reinforce competitive strategies in
(Payne and Frow,
the market to meet business objectives and build shareholder
2005; Suoniemi et al.,
value by looking at the Macro and Micro Environmental
2021; Tamošiūnienė
factors related to disruptive innovation. The ability to create value by migrating the focus from
and Jasilionienė, 2007) (Awasthi and Sangle,
product to customer-centric, through building capabilities,
2012; Guerola-Navarro
which allows the firms to increase customer satisfaction,
et al., 2021; Hillebrand
revenue, and profit optimisation. Furthermore, creating a clear
et al., 2011; Ngai,
segmentation plan, which will help identify customers’ needs
2005; Payne and Frow,
and wants, and tailor organisation interactions accordingly.
2005; Tamošiūnienė
Customer-centricity includes multiple subcategories such as:
and Jasilionienė, 2007)
Capability
Customer Centricity
1. Customer identification 2. Customer attraction 3. Customer retention 4. Customer development Market Orientation
It is an organisational wide implementation of the marketing
(Crick et al., 2022)
concept which helps decision makers develop effective marketing strategies by generating, disseminating and safeguarding the information of their customers and competitors. Organisational
It is the degree to which the processes, practices, and
AI-Climate
procedures within service operations are supported with AI
(Akter et al., 2021)
initiatives. A sustainable AI climate focused on employee wellbeing, customer satisfaction, and organisations’ profitability. Technology
Innovation
The ability of CRM to improve organisations’ innovation
(Dalla Pozza et al.,
Infrastructure
Capabilities
capabilities, which will help enhance business characteristics
2018; Guerola-Navarro
and organisations’ performance in order to optimise business
et al., 2021; Saura et
success by focusing on innovation within the context of
al., 2021)
business management which includes: 1. Product innovation 2. Process innovation 3. Administrative innovation 4. Marketing innovation 5. Service innovation
288 Handbook of big data research methods Dimensions
Sub- Dimensions
Definitions
References
Data
Data Repository/
A powerful integrated enterprise data storage that provides
(Geib et al., 2006;
Infrastructure
Data integration
corporate memory of a customer and can analyse relevant
Payne and Frow, 2005;
customer data. Data integration is key to AI-based CRM
Payne and Frow, 2006)
collaboration, which helps consolidate customer data from various networks to simplify data integration process, it employs a joint data model for customer-related data. Data Quality
The process of data warehousing where the main focus is on
(Alshawi et al., 2011;
data accuracy, recency, comprehensiveness and consistency,
Nam et al., 2019)
by maintaining a good data quality that will enable firms to execute effective AI-based CRM strategies. Its main functions include: ● Reliability ● Accuracy ● Consistency ● Recency ● Comprehensiveness Data Privacy
The process of carefully storing, securing and protecting
(De Jong et al., 2021;
confidential consumer personal information from loss,
Martin and Murphy,
dissemination, disclosure and compromise. Its main functions
2016; Zhang and
include:
Watson, 2020)
● Privacy safeguards ● Data protection ● Cybersecurity ● Data surveillance ● Privacy regulations Data Analysis
The process of acquiring, processing, analysing and
(Hillebrand et al.,
visualising data, that is used to examine and measure
2011; Hong-kit Yim
business activities to discover useful information which
et al., 2004; Libai et
will help drive smarter decisions, where AI plays a crucial
al., 2020; Payne and
part in data analysis to analyse customer behaviour, CRM
Frow, 2005; Saura et
activities including knowledge management, which can be
al., 2021)
fully optimised through leveraging the latest technology innovations, developing prediction models, analysing customer patterns and creating value by providing personalised offerings.
Understanding the Future trends and innovations of AI-based CRM systems 289 Dimensions
Sub- Dimensions
Definitions
References
Service
Consistency in
An interactive process where it identifies both the value the
(Boulding et al., 2005;
Offerings
offers – intentions
customer receives from the organisation, and the value the
Itani et al., 2020; Libai
and value offerings – organisation receives from the customer, where the customer
et al., 2020; Nijssen et al., 2017; Payne and
Value Creation
is perceived as the value co-creator.
Service Interactions/
The process of adding value to the entity’s strategy by creating (Awasthi and Sangle,
Channel Integration
a single unified view of each customer journey and translating
2012; Bradlow et al.,
customer interaction via more than one channel, to meet
2017; Hallikainen et
business goals through understanding customer experience
al., 2020; Payne and
and executing significant activities to satisfy user experience.
Frow, 2005; Saura et
Frow, 2006)
al., 2021) Service Innovation
Service innovation is a means of developing existing or
(Akter at al. 2021;
creating new service practices and resources that directly or
Woo et al., 2021)
indirectly result in creation of new value proposition for the organisation and its customers.
The table above demonstrates the dimensions that are used to study CRM and innovation capabilities as a tool to identify valuable customers and maintain their loyalty (Guerola-Navarro et al., 2021). Hillebrand et al. (2011) indicated that the customer intimacy strategy helps the organisation build a strong customer relationship and meet customers’ needs. Innovation is another key element of firms’ capabilities, which will allow analytical CRM to evolve and acquire multiple intel which will allow the company to understand user behaviour, responses and execute advanced focused strategies to ease the decision-making process (Saura et al., 2021). In these circumstances, innovation and technology advancement play a huge part in firms’ profitability. According to Tamošiūnienė and Jasilionienė (2007), technology-centric is an IT-driven concept that was designed to improve business processes by focusing on different customers’ touchpoints to understand their perspectives, needs and wants. The study was conducted in Lithuania with a major focus on three CRM components: (1) Customer; (2) Relationship; and (3) Management. Innovation can help organisations solve difficult challenges (Nijssen et al., 2017), whereas information management is the process of collecting, disseminating and archiving customer insights to help organise customer knowledge and generate appropriate marketing responses by “replicating the mind of the customer”, which helps drive CRM activities (Llamas-Alonso et al., 2009; Payne and Frow, 2006). The data capability is about using sophisticated tools such as database marketing, data mining, data warehousing, and push technology incorporated within the CRM systems to help build enduring customer relationships (Hong-kit Yim et al., 2004). Customer database capabilities will help improve the coordination between information technology and marketing, which will help enhance AI-based CRM process interaction and improve the prediction models to deliver desired customised services and automate customer interactions to empower customers’ trust (Tamošiūnienė and Jasilionienė, 2007). According to Awasthi and Sangle (2012), there are seven data mining concepts which help increase customer loyalty in CRM, namely: association, classification, clustering, forecasting, regression, sequence discovery and visualisation. Furthermore, AI-based CRM systems are used to manage companies’ information where data-driven decision making plays a major part in companies’ strategy (Saura et al., 2021). Data helps in adding value and creates competitive advantage by using intelligent management systems that aim to optimise satisfaction and increase loyalty.
290 Handbook of big data research methods Another dimension is the use of multichannel integration, which includes direct and indirect multichannel CRM processes such as eCRM, mCRM and sCRM by using interactive tools including the internet, mail, sales calls, social media, and so on, to collect intel, and to manage and maintain customers efficiently and effectively (Awasthi and Sangle, 2012). This process helps the organisation understand their customers’ behaviours and their preferred channels of interaction implemented in a multichannel environment without having a unified view of the customer (Llamas-Alonso et al., 2009; Awasthi and Sangle, 2012; Saura et al., 2021). The multichannel environment is about selecting the best channel to communicate the message to the customer and ensuring the customer receives a seamless experience tailored to their particular needs during their interaction with multiple touchpoints (Bradlow et al., 2017; Hallikainen et al., 2020).
6.
AI-BASED CRM AND COVID-19
The COVID-19 crisis has impacted many industries (Shin and Kang, 2020), where some businesses have shut down and some have shifted their employees to remote working or home-based working, which has drastically changed the communication patterns (Sharma et al., 2020). During the pandemic, firms have recognised the importance of optimising their technological capabilities and understanding their innovative processes and knowledge-sharing strategies, as this can have a major impact on their strategic decision-making process (Chi, 2021). Conversely, some customers have shown an increase in their purchasing probability for low-priced offerings (Habel et al., 2020). A major CRM cloud-based software company, Salesforce, showed a growth in revenue of 24 per cent by the end of the fiscal year at January 2021 during COVID-19, the 360 customer platform such as AI-powered Einstein providing an increase in returns for companies like AT&T, Zoom and Carrefour (Farber, 2021). According to Sharma et al. (2020), COVID-19 crises have enabled digital interaction to double, as companies that are more flexible and resilient to change adopted advanced technologies which have benefited customers’ needs through multichannel integration, when in-person presence was eliminated. This indicates that the fast adoption of AI technologies can be more critical during a global crisis, and it can enhance employees’ performance (Sharma et al., 2020). As firms have shifted to smart technologies by continuously engaging their customers through delivering virtual tours of online festivals, parks, museums, and gaming events, this is considered as a technology-led solution that has emerged to deliver a temporary getaway and offset the limitations of physical presence (Kim and So, 2022).
7. CONCLUSION This study identified AI-based CRM dimensions with provided factors that influence these dimensions, while relevant examples were highlighted. From a practical point of view, the value of this study is to identify the most significant aspects of AI-based CRM that influence organisational performance and customer satisfaction. This research has some limitations as it only focused on one aspect of AI-Based CRM technology; further studies on empirically evaluating these processes are required. Future studies might be relevant in the areas of under-
Understanding the Future trends and innovations of AI-based CRM systems 291 standing how multimodal AI will impact the CRM process, understanding the core relationship between AI-based CRM and Customer Referral Value (CRV) and also how the dark side of AI-based CRM will impact consumer privacy.
REFERENCES Agnihotri, R. 2020. ‘Social media, customer engagement, and sales organizations: A research agenda’, Industrial Marketing Management, vol. 90, pp. 291–9. Ahmad, B., Liu, D., Akhtar, N. and Siddiqi, U.I. 2022. ‘Does service-sales ambidexterity matter in business-to-business service recovery? A perspective through salesforce control system’, Industrial Marketing Management, vol. 102, pp. 351–63. Akter, S., Wamba, S.F., Mariani, M. and Hani, U. 2021. ‘How to build an AI climate-driven service analytics capability for innovation and performance in industrial markets?’, Industrial Marketing Management, vol. 97, pp. 258–73. Alshawi, S., Missi, F. and Irani, Z. 2011. ‘Organisational, technical and data quality factors in CRM adoption – SMEs perspective’, Industrial Marketing Management, vol. 40, no. 3, pp. 376–83. Awasthi, P. and Sangle, P.S. 2012. ‘Adoption of CRM technology in multichannel environment: A review (2006–2010)’, Business Process Management Journal, vol. 18, no. 3, pp. 445–71. Baabdullah, A.M., Alalwan, A.A., Slade, E.L., Raman, R. and Khatatneh, K.F. 2021. ‘SMEs and artificial intelligence (AI): Antecedents and consequences of AI-based B2B practices’, Industrial Marketing Management, vol. 98, pp. 255–70. Bildea, T.S. and Gorin, T. 2017. ‘Towards capturing ancillary revenue via unbundling and cross-selling’, Journal of Revenue and Pricing Management, vol. 17, no. 2, pp. 102–114. Bohling, T., Bowman, D., LaValle, S., Mittal, V., Narayandas, D., Ramani, G. and Varadarajan, R. 2006. ‘CRM implementation: Effectiveness issues and insights’, Journal of Service Research, vol. 9, no. 2, pp. 184–94. Boulding, W., Staelin, R., Ehret, M. and Johnston, W.J. 2005, ‘A customer relationship management roadmap: What is known, potential pitfalls, and where to go’, Journal of Marketing, vol. 69, no. 4, pp. 155–66. Bradlow, E.T., Gangwar, M., Kopalle, P. and Voleti, S. 2017. ‘The role of big data and predictive analytics in retailing’, Journal of Retailing, vol. 93, no. 1, pp. 79–95. Buttle, F. 2003. Customer Relationship Management, Jordan Hill: Taylor & Francis Group. Available from: ProQuest Ebook Central. Chatterjee, S., Ghosh, S.K., Chaudhuri, R. and Chaudhuri, S. 2021b. ‘Adoption of AI-integrated CRM system by Indian industry: From security and privacy perspective’, Information Management & Computer Security, vol. 29, no. 1, pp. 1–24. Chatterjee, S., Rana, N.P., Tamilmani, K. and Sharma, A. 2021c, ‘The effect of AI-based CRM on organization performance and competitive advantage: An empirical analysis in the B2B context’, Industrial Marketing Management, vol. 97, pp. 205–19. Chatterjee, S., Chaudhuri, R., Vrontis, D., Thrassou, A. and Ghosh, S.K. 2021a. ‘Adoption of artificial intelligence-integrated CRM systems in agile organizations in India’, Technological Forecasting & Social Change, vol. 168, p. 120783. Chen, I.J. and Popovich, K. 2003. ‘Understanding customer relationship management (CRM): People, process and technology’, Business Process Management Journal, vol. 9, no. 5, pp. 672–88. Chi, N.T.K. 2021. ‘Innovation capability: The impact of e-CRM and COVID-19 risk perception’, Technology in Society, vol. 67, p. 101725. Chi, O.H., Denton, G. and Gursoy, D. 2020. ‘Artificially intelligent device use in service delivery: A systematic review, synthesis, and research agenda’, Journal of Hospitality Marketing & Management, vol. 29, no. 7, pp. 757–86. Crick, J.M., Karami, M. and Crick, D. 2022, ‘Is it enough to be market-oriented? How coopetition and industry experience affect the relationship between a market orientation and customer satisfaction performance’, Industrial Marketing Management, vol. 100, pp. 62–75.
292 Handbook of big data research methods Dalla Pozza, I., Goetz, O. and Sahut, J.M. 2018. ‘Implementation effects in the relationship between CRM and its performance’, Journal of Business Research, vol. 89, pp. 391–403. De Jong, A., De Ruyter, K., Keeling, D.I., Polyakova, A. and Ringberg, T. 2021. ‘Key trends in business-to-business services marketing strategies: Developing a practice-based research agenda’, Industrial Marketing Management, vol. 93, pp. 1–9. Farber, D. 2021. ‘Forever Changed’: Salesforce Looks Back at Year Defined by COVID-19, Salesforce news & insights. Accessed 30 March 2021 at: https://www.salesforce.com/news/stories/fy21-year-in -review/. Geib, M., Kolbe, L.M. and Brenner, W. 2006. ‘CRM collaboration in financial services networks: A multi-case analysis’, Journal of Enterprise Information Management, vol. 19, no. 6, pp. 591–607. Gligor, D.M., Pillai, K.G. and Golgeci, I. 2021, ‘Theorizing the dark side of business-to-business relationships in the era of AI, big data, and blockchain’, Journal of Business Research, vol. 133, pp. 79–88. Guerola-Navarro, V., Gil-Gomez, H., Oltra-Badenes, R. and Sendra-García, J. 2021, ‘Customer relationship management and its impact on innovation: A literature review’, Journal of Business Research, vol. 129, pp. 83–7. Gursoy, D., Chi, O.H., Lu, L. and Nunkoo, R. 2019, ‘Consumers acceptance of artificially intelligent (AI) device use in service delivery’, International Journal of Information Management, vol. 49, pp. 157–69. Gustafson, B.M. and Pomirleanu, N. 2021, ‘A discursive framework of B2B brand legitimacy’, Industrial Marketing Management, vol. 93, pp. 22–31. Habel, J., Jarotschkin, V., Schmitz, B., Eggert, A. and Plötner, O. 2020, ‘Industrial buying during the coronavirus pandemic: A cross-cultural study’, Industrial Marketing Management, vol. 88, pp. 195–205. Hajipour, B. and Esfahani, M. 2019, ‘Delta model application for developing customer lifetime value’, Marketing Intelligence & Planning, vol. 37, no. 3, pp. 298–309. Hallikainen, H., Savimäki, E. and Laukkanen, T. 2020, ‘Fostering B2B sales with customer big data analytics’, Industrial Marketing Management, vol. 86, pp. 90–98. Herhausen, D., Miočević, D., Morgan, R.E. and Kleijnen, M.H. 2020, ‘The digital marketing capabilities gap’, Industrial Marketing Management, vol. 90, pp. 276–90. Hillebrand, B., Nijholt, J. and Nijssen, E. 2011, ‘Exploring CRM effectiveness: An institutional theory perspective’, Journal of the Academy of Marketing Science, vol. 39, no. 4, pp. 592–608. Hong-kit Yim, F., Anderson, R.E. and Swaminathan, S. 2004, ‘Customer Relationship Management: Its dimensions and effect on customer outcomes’, The Journal of Personal Selling & Sales Management, vol. 24, no. 4, pp. 263–78. Hossain, T.M.T., Akter, S., Kattiyapornpong, U. and Dwivedi, Y. 2020, ‘Reconceptualizing integration quality dynamics for omnichannel marketing’, Industrial Marketing Management, vol. 87, pp. 225–41. Huang, M.-H. and Rust, R.T. 2017, ‘Technology-driven service strategy’, Journal of the Academy of Marketing Science, vol. 45, no. 6, pp. 906–24. Itani, O.S., Krush, M.T., Agnihotri, R. and Trainor, K.J. 2020, ‘Social media and customer relationship management technologies: Influencing buyer–seller information exchanges’, Industrial Marketing Management, vol. 90, pp. 264–75. Jayachandran, S., Sharma, S., Kaufman, P. and Raman, P. 2005, ‘The role of relational information processes and technology use in customer relationship management’, Journal of Marketing, vol. 69, no. 4, pp. 177–92. Järvinen, J. and Taiminen, H. 2016, ‘Harnessing marketing automation for B2B content marketing’, Industrial Marketing Management, vol. 54, pp. 164–75. Kim, H.-S. and Kim, Y.-G. 2009, ‘A CRM performance measurement framework: Its development process and application’, Industrial marketing management, vol. 38, no. 4, pp. 477–89. Kim, H. and So, K.K.F. 2022, ‘Two decades of customer experience research in hospitality and tourism: A bibliometric analysis and thematic content analysis’, International Journal of Hospitality Management, vol. 100, p. 103082.
Understanding the Future trends and innovations of AI-based CRM systems 293 Ko, T., Lee, J.H., Cho, H., Cho, S., Lee, W. and Lee, M. 2017, ‘Machine learning-based anomaly detection via integration of manufacturing, inspection and after-sales service data’, Industrial Management + Data Systems, vol. 117, no. 5, pp. 927–45. Kotorov, R. 2003, ‘Customer relationship management: Strategic lessons and future directions’, Business Process Management Journal, vol. 9, no. 5, pp. 566–71. Kumar, M. 2020, ‘Effective usage of E-CRM and social media tools by Akshay Kumar: Most prolific Bollywood actor of last decade’, International Journal of Management, vol. 11, no. 2. Kumar, V., George, M. and Pancras, J. 2008, ‘Cross-buying in retailing: Drivers and consequences’, Journal of Retailing, vol. 84, no. 1, pp. 15–27. Kumar, V., Rajan, B., Venkatesan, R. and Lecinski, J. 2019, ‘Understanding the role of artificial intelligence in personalized engagement marketing’, California Management Review, vol. 61, no. 4, pp. 135–55. Leigh, T.W. and Tanner, J.F. 2004, ‘Introduction: JPSSM Special Issue on Customer Relationship Management’, The Journal of Personal Selling & Sales Management, vol. 24, no. 4, pp. 259–62. Libai, B., Bart, Y., Gensler, S., Hofacker, C.F., Kaplan, A., Kötterheinrich, K. and Kroll, E.B. 2020, ‘Brave New World? On AI and the management of customer relationships’, Journal of Interactive Marketing, vol. 51, no. 1. pp. 44–56. Lin, R.-J., Chen, R.-H. and Kuan-Shun Chiu, K. 2010, ‘Customer relationship management and innovation capability: An empirical study’, Industrial Management & Data Systems, vol. 110, no. 1, pp. 111–33. Lipiäinen, H.S.M. 2015, ‘CRM in the digital age: Implementation of CRM in three contemporary B2B firms’, Journal of Systems and Information Technology, vol. 17, no. 1, pp. 2–19. Llamas-Alonso, M.R., Jiménez-Zarco, A.I., Martínez-Ruiz, M.P. and Dawson, J. 2009, ‘Designing a predictive performance measurement and control system to maximize customer relationship management success’, Journal of Marketing Channels, vol. 16, no. 1, pp. 1–41. Long, C.S. and Khalafinezhad, R. 2012, ‘Customer satisfaction and loyalty: A literature review in the perspective of customer relationship management’, Journal of Applied Business and Finance Researches, vol. 1, no. 1, pp. 6–13. Lv, X., Yang, Y., Qin, D., Cao, X. and Xu, H. 2022, ‘Artificial intelligence service recovery: The role of empathic response in hospitality customers’ continuous usage intention’, Computers in Human Behavior, vol. 126, p. 106993. Martin, K.D. and Murphy, P.E. 2016, ‘The role of data privacy in marketing’, Journal of the Academy of Marketing Science, vol. 45, no. 2, pp. 135–55. Meena, P. and Sahu, P. 2021, ‘Customer Relationship Management research from 2000 to 2020: An academic literature review and classification’, Vision (New Delhi, India), vol. 25, no. 2, pp. 136–58. Meyners, J., Barrot, C., Becker, J.U. and Bodapati, A.V. 2017, ‘Reward-scrounging in customer referral programs’, International Journal of Research in Marketing, vol. 34, no. 2, pp. 382–98. Nam, D., Lee, J. and Lee, H. 2019, ‘Business analytics use in CRM: A nomological net from IT competence to CRM performance’, International Journal of Information Management, vol. 45, pp. 233–45. Nelson, C.A., Walsh, M.F. and Cui, A.P. 2020, ‘The role of analytical CRM on salesperson use of competitive intelligence’, The Journal of Business & Industrial Marketing, vol. 35, no. 12, pp. 2127–37. Ngai, E.W. 2005, ‘Customer relationship management research (1992–2002): An academic literature review and classification’, Marketing Intelligence & Planning, vol. 23, no. 6, pp. 582–605. Nijssen, E.J., Guenzi, P. and van der Borgh, M. 2017, ‘Beyond the retention–acquisition trade-off: Capabilities of ambidextrous sales organizations’, Industrial Marketing Management, vol. 64, pp. 1–13. Obaze, Y., Xie, H., Prybutok, V.R., Randall, W. and Peak, D.A. 2021, ‘Contextualization of relational connectedness construct in relationship marketing’, Journal of Nonprofit & Public Sector Marketing, pp. 1–32. Payne, A. and Frow, P. 2005, ‘A strategic framework for customer relationship management’, Journal of Marketing, vol. 69, no. 4, pp. 167–76. Payne, A. and Frow, P. 2006, ‘Customer Relationship Management: From strategy to implementation’, Journal of Marketing Management, vol. 22, no. 1–2, pp. 135–68.
294 Handbook of big data research methods Peco-Torres, F., Polo-Peña, A.I. and Frías-Jamilena, D.M. 2021, ‘Revenue management and CRM via online media: The effect of their simultaneous implementation on hospitality firm performance’, Journal of Hospitality and Tourism Management, vol. 47, pp. 46–57. Richards, K.A. and Jones, E. 2008, ‘Customer relationship management: Finding value drivers’, Industrial Marketing Management, vol. 37, no. 2, pp. 120–30. Saura, J.R., Ribeiro-Soriano, D. and Palacios-Marqués, D. 2021, ‘Setting B2B digital marketing in artificial intelligence-based CRMs: A review and directions for future research’, Industrial Marketing Management, vol. 98, pp. 161–78. Sharma, A., Rangarajan, D. and Paesbrugghe, B. 2020, ‘Increasing resilience by creating an adaptive salesforce’, Industrial Marketing Management, vol. 88, pp. 238–46. Shin, H. and Kang, J. 2020, ‘Reducing perceived health risk to attract hotel customers in the COVID-19 pandemic era: Focused on technology innovation for social distancing and cleanliness’, International Journal of Hospitality Management, vol. 91, pp. 102664–102664. Shokouhyar, S., Shokoohyar, S. and Safari, S. 2020, ‘Research on the influence of after-sales service quality factors on customer satisfaction’, Journal of Retailing and Consumer Services, vol. 56, p. 102139. Snyder, H. 2019, ‘Literature review as a research methodology: An overview and guidelines’, Journal of Business Research, vol. 104, pp. 333–9. Suoniemi, S., Terho, H., Zablah, A., Olkkonen, R. and Straub, D.W. 2021, ‘The impact of firm-level and project-level IT capabilities on CRM system quality and organizational productivity’, Journal of Business Research, vol. 127, pp. 108–22. Syam, N. and Sharma, A. 2018, ‘Waiting for a sales renaissance in the fourth industrial revolution: Machine learning and artificial intelligence in sales research and practice’, Industrial Marketing Management, vol. 69, pp. 135–46. Tamošiūnienė, R. and Jasilionienė, R. 2007, ‘Customer relationship management as business strategy appliance: Theoretical and practical dimensions’, Journal of Business Economics and Management, vol. 8, no. 1, pp. 69–78. Toorajipour, R., Sohrabpour, V., Nazarpour, A., Oghazi, P. and Fischl, M. 2021, ‘Artificial intelligence in supply chain management: A systematic literature review’, Journal of Business Research, vol. 122, pp. 502–17. Troisi, O., Maione, G., Grimaldi, M. and Loia, F. 2020, ‘Growth hacking: Insights on data-driven decision-making from three firms’, Industrial Marketing Management, vol. 90, pp. 538–57. Vesal, M., Siahtiri, V. and O’Cass, A. 2021, ‘Strengthening B2B brands by signalling environmental sustainability and managing customer relationships’, Industrial Marketing Management, vol. 92, pp. 321–31. Woo, H., Kim, S.J. and Wang, H. 2021, ‘Understanding the role of service innovation behavior on business customer performance and loyalty’, Industrial Marketing Management, vol. 93, pp. 41–51. Zhang, J.Z. and Watson IV, G.F. 2020, ‘Marketing ecosystem: An outside-in view for sustainable advantage’, Industrial Marketing Management, vol. 88, pp. 287–304.
18. Descriptive analytics methods in big data: a systematic literature review Nilupulee Liyanagamage and Mario Fernando
1.
INTRODUCTION TO BIG DATA
The explosion of available information in the era of Big Data has received much attention from academics and practitioners across disciplines (Fan et al., 2014; Wamba et al., 2015; Boyd and Crawford, 2012). The Big Data movement has brought new opportunities for modern society (Fan et al., 2014), particularly with its ability to transform businesses processes, facilitate innovation and “revolutionize the art of management” (Wamba et al., 2015: 234). This trend offers technologies that promise to provide the right information to the right person at the right time, in the right volume and quality (Buchmüller et al., 2014). The extant literature defines Big Data with 3Vs: volume, velocity and variety (Kitchin and McArdle, 2016). Volume is the quantity of data; velocity is the speed of data generation or data delivery; and variety determines various configurations, sources and formats of the data – structured, unstructured or even semi-structured (Kitchin and McArdle, 2016; Zhou et al., 2014). Some scholars have considered two other Vs to strive for a more holistic approach to understanding Big Data (Wamba et al., 2015). That is the inclusion of value, which is the economic or social benefits (Günther et al., 2017), and veracity, the importance of quality data and trust in data sources (White, 2012). Despite the interest and trends in Big Data, there is much to know about the concept, its real potential (Wamba et al., 2015; Labrinidis and Jagadish, 2012), and the unique computational and statistical challenges that it presents (Fan et al., 2014; Boyd and Crawford, 2012). The common understanding is that “size is the only thing that matters” (Labrinidis and Jagadish, 2012: 2032). This has led to the assumption that the more data we process, the more likely we will improve decision-making in business, science and politics, and even in our private lives (Buchmüller et al., 2014). This has tempted researchers to adopt Big Data, as a replacement for traditional methods of data analysis without considering their challenges. Big Data is different from traditional and small data sets, in terms of methods, sampling, data quality, repurposing and management (Kitchin and McArdle, 2016). Data analysis is challenging for Big Data for various reasons. This could be due to the lack of coordination between database systems (Labrinidis and Jagadish, 2012); difficulty interpreting results (Labrinidis and Jagadish, 2012; Buchmüller et al., 2014); sources of error such as computer system bugs or erroneous data (Kitchin and McArdle, 2016); computational costs; statistical bias (Fan et al., 2014). To deliver the promise of data-driven decision-making to unleash new organisational capabilities and value (Kitchin and McArdle, 2016; Labrinidis and Jagadish, 2012; Wamba et al., 2015), it is important that scholars not only discuss or explore the what and the why of using Big Data, but more critically, evaluate how Big Data can be employed. To interpret and understand the how aspect, it is important to examine how scholars research Big Data. Although there are three key types of Big Data analytics –descriptive, predictive and prescriptive analytics – in 295
296 Handbook of big data research methods this chapter, we focus on the descriptive analytics method. While predictive and prescriptive analytics are gathering interest (Larson and Chang, 2016), the majority of organisations still mainly rely on descriptive analytics in Big Data analysis (Duan and Xiong, 2015; Zhou et al., 2014; Lepenioti et al., 2020; Tabesh et al., 2019). This importance is reflected in the several review papers relating to descriptive analytics (see Duan and Xiong, 2015; Sun et al., 2013; Tsai et al., 2015). However, these reviews do not particularly focus on descriptive analytics; rather they discuss Big Data analytics broadly while touching on descriptive analytics. Our aims in this chapter are twofold. First, we draw on prior scholarly publications to explore how Big Data is analysed. Second, we assess the role descriptive analytic methods play in Big Data analysis. This chapter is organised as follows. First, we examine the potential data analytical methods in Big Data. Second, we introduce the research methodology for this systematic literature review. Third, drawing on prior research we provide definitions of descriptive analytics methods and review the applications, benefits and challenges of descriptive analytics in Big Data analysis. Finally, we discuss the implications and limitations and outline directions for future research using descriptive analytics.
2.
BIG DATA ANALYSIS (BDA)
Big Data has transformed data-driven research in business, science, engineering, education and various other fields of study (Nguyen et al., 2018; Swaminathan, 2018; Duan and Xiong, 2015). Although many business organisations have attempted to take advantage of Big Data, studies show that only a small percentage actually benefit from their investments while the majority fail to successfully implement Big Data driven decision-making (Ross et al., 2013; Tabesh et al., 2019). These failures signify the importance of data analytics. Without Big Data Analytics (BDA), Big Data would not be able to create as much value (Babiceanu and Seker, 2016). BDA is the technique used to analyse, extract meaningful patterns, predictions, and visualise knowledge and intelligence from Big Data for decision-making (Najafabadi et al., 2015; Duan and Xiong, 2015; Sun et al., 2018). BDA facilitates statistical reliability in research, as Big Data provides a large population size from a high volume of data (Duan and Xiong, 2015). Extant literature identified three main types of data analytic methods: descriptive analytics, predictive analytics and prescriptive analytics (Duan and Xiong, 2015; Tabesh et al., 2019). Descriptive analytics provides a summary of descriptive statistics for a given data sample (Duan and Xiong, 2015). For instance, it could include mean, mode, median, range, histogram, and standard deviation. These results could be displayed through graphs, tables and charts. Descriptive analytics supports us to answer questions such as, “What has happened?”, “Why did it happen?” “What is happening now?” (Lepenioti et al., 2020: 57). By answering these questions, organisations can uncover reasons why a certain event occurred in the past and also identify relationships among data (Soltanpoor and Sellis, 2016). Predictive analytics use artificial intelligence (AI), optimisation algorithms and expert systems to predict future behaviours based on patterns uncovered in the past and the assumption that history will repeat (Duan and Xiong, 2015; Lepenioti et al., 2020; Gunasekaran et al., 2017). Based on this information, predictive analytics can answer “What will happen” “Why will it happen?” (Lepenioti et al., 2020: 57). The most common predictive analytic methods
Descriptive analytics methods in big data 297 are decision tree, pattern recognition, Bayesian statistics, regression, and neural network and Markov model (Tabesh et al., 2019; Duan and Xiong, 2015). Lastly, prescriptive analytics employs mathematical programming, and simulation modelling to identify the optimum action (Lepenioti et al., 2020). Prescriptive analytics answers the questions “What should I do?” and “Why should I do it?” (Lepenioti et al., 2020: 57). Table 18.1 provides a summary of the three Big Data analytic methods and their common algorithms. Table 18.1
An overview of the Big Data analytics algorithms
Categories
Algorithms
Descriptive
Clustering is the process of allocating items into groups; segmenting.
analytics
Five types of clustering: partitioning-based, hierarchical-based, density-based, grid-based and model-based (Fahad et al., 2014). Associations is the search for sets of items frequently appearing together in a data set; bundling data (Tabesh et al., 2019). Generative models and Sequential pattern discovery looks for patterns in the data, and works as recommendation systems (Duan and Xiong, 2015; Tabesh et al., 2019).
Predictive
Probabilistic models present uncertainty in causal relationships. This can be used to calculate the probability of
analytics
a certain event occurring (Martinez et al., 2009). Machine learning relies on algorithms that can process without explicit instructions. Machine learning is interrelated with data mining. Using these two techniques, businesses can extract data and uncover information to predict future outcomes (Lepenioti et al., 2020). Statistical analysis deals with data collection, analysis, presentation and interpretation (Lepenioti et al., 2020).
Prescriptive
Mathematical programming can be used to program and plan the most optimal allocation of scarce resources
analytics
(Lepenioti et al., 2020). Simulation involves modelling real-world situations on a computer to see how the system works. This can be used to improve effectiveness of human decision-making (Greasley and Edwards, 2021). Logic-based models hypothesise various causes and effects to result in outcomes of interest (Lepenioti et al., 2020).
3.
SYSTEMATIC LITERATURE REVIEW PROCESS
In this section, we outline the literature review process based on Tranfield et al. (2003). This systematic review methodology has been widely employed for literature reviews in data analytics (Lepenioti et al., 2020). In the first stage of the review process, we searched the literature using the query term “Big Data descriptive analytics”. The search was conducted in the following databases: ABI/Inform Complete, Academic Search Complete, Business Source Complete, Elsevier (SCOPUS), Emerald, IEEEXplore, Taylor & Francis, and Association of Information Systems (AIS). Our search was limited to journal articles. We excluded books, conference publications, and grey literature, to ensure quality and validity of information and the review. In the initial phase, we queried the databases to find journal articles that contain the query “Big Data” in the abstract or title and “descriptive analytics” in the full text of the publication (Table 18.2). This search was conducted in October 2021. The first phase of the search resulted in a total of 2726 papers. However, we identified that not all search results contribute to
298 Handbook of big data research methods descriptive analytics in Big Data analytics. Some journal articles mention descriptive analytics in their introductory paragraph without contributing to the field. Phase I database results for the search query in the abstract and full text of the journal article
Table 18.2 ABI/Inform
Academic
Business
Elsevier
Complete
Search
Source
(SCOPUS)
Complete
Complete
342
304
343
Emerald
IEEEXplore
Taylor &
Association of
Francis
Information Systems (AIS)
398
159
121
1,035
24
Therefore, in the second phase, we searched journal articles with the query term in their metadata, that is, title, abstract and keywords. This phase resulted in 331 papers, as shown in Table 18.3. The results from this phase showed an increase in the use of “descriptive” analytics in publications throughout the years. These trends highlight the need for a field-specific literature review on descriptive analytics in Big Data. Table 18.3
Phase II database results for the search query in the abstract of the journal article
ABI/Inform
Academic
Business
Elsevier
Complete
Search
Source
(SCOPUS)
Complete
Complete
17
15
18
Emerald
IEEEXplore
Taylor &
Association of
Francis
Information
169
0
Systems (AIS) 104
0
8
In the final phase, we aimed to search the databases from a more in-depth level. In this phase we followed several inclusion criteria: publication date limited to journal articles published after 2010; the published journal is either A/A* in the ABDC (Australian Business Deans Council) Journal Quality list or Q1 in the SCImago Journal Rank in 2021; contributes to the field of descriptive analytics in Big Data; published in the English language. The third phase resulted in 28 papers. Table 18.4
The distribution of papers on descriptive analytics from 2010 to 2022
Year
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
Papers
0
0
0
0
0
2
0
3
5
5
5
8
The distribution of the number of papers with respect to years is shown in Table 18.4. The results show that descriptive analytics is gaining more attention in highly ranked scholarly journals. However, no journal has published an article specifically on descriptive analytics. This gap in the research work signifies the need for this discussion.
4.
ANALYSIS OF REVIEWED PAPERS
As mentioned in sections 1 and 2, this discussion gives precedent to descriptive analytics as it is the base of Big Data analytics. Predictive and prescriptive analytics follows from the data or understandings of descriptive analytics. As such, the majority of revised research in this
Descriptive analytics methods in big data 299 chapter includes methods for descriptive analytics. In this section, we present and discuss the methods that have been identified in the reviewed papers for descriptive analytics. 4.1
What are the Descriptive Analytics Methods?
The classification of the methods for descriptive analytics that we have identified in the reviewed papers is illustrated in Figure 18.1. There are six classifications that arose from the reviewed papers: Association Rule, Clustering and Classification, Data Mining, Data Visualisation, Descriptive Statistics and Regression. However, sometimes the boundaries between categories overlap or are not clearly defined in the revised papers. As such, we have provided definitions from the literature for the classified methods for descriptive analytics. Data mining Data mining is known as a process to extract and identify significant information and gain knowledge from large data sets. Data mining uses statistical, mathematical, artificial intelligence and machine learning techniques to deal with complex and dynamic data (Tang et al., 2017). There are several types of data mining, including pictorial data mining, text mining, social media mining, audio and video mining and web mining, among others. Text mining is a data mining technique used to extract meaningful terms from text documents, to be able to examine them using statistical analysis methods (Jun et al., 2015). Sentiment mining is a text mining technique that is a process of studying the sentiments of people. To ensure accurate sentiment reviews, fuzzy sets and fuzzy sentiments are used. This analysis can help one understand the tone of a review, for instance, whether a person is showing positive, negative or neutral emotions (Alekh et al., 2021). Data mining techniques include: association rule, classification, clustering, and regression. Figure 18.1 illustrates the breakdown of descriptive analytics as identified in the reviewed literature. Association models Association models are a descriptive analytic method, which is applied to calculate the correlation between items. Association Rule mining and Apriori algorithms are widely used to find association models (Lian et al., 2020). Social network analysis (SNA) is a common Association Rule model that examines social connections between terms (Jun et al., 2015). Regression Regression models are applied to identify trends in data (Lian et al., 2020) and to evaluate the significance of variables (Ko and Chang, 2018). Approaches to regression include Support vector regression (Mulerikkal et al., 2021), Ordinary least square multiple regression, Multiple regression (Wang et al., 2019), and linear regression (Nasir and Sassani, 2021). Although regression models are most common in predictive analytics, they are also seen in descriptive analytics. Clustering and classification Clustering seeks to separate objects into similar groups (Jun et al., 2015), based on similarities between objects (Lian et al., 2020) to reveal patterns (Alekh et al., 2021). There are different types of clustering, such as K-means clustering, Hierarchical clustering, Spectral clustering
300 Handbook of big data research methods
Figure 18.1
Classifications of the descriptive analytics methods
and k-nearest neighbour clustering (Lian et al., 2020). In the k-means clustering, k determines the optimal number of clusters that can be formed from the data. Techniques such as the elbow method or silhouette method can be applied to determine the optimum number of clusters (Alekh et al., 2021; Jun et al., 2015). The classification model supports data mining of unstructured data into groups or classes (Lian et al., 2020). Research identifies that Tree-based techniques, such as Decision Tree and Random Forest, are commonly applied classification models. In the Random Forest approach, the randomisation can build many trees to form a forest, and these trees can be constructed via a classification and regression tree, also known as CART. Although classification models are often used in predictive analytics (Lian et al., 2020), they are also applied in descriptive analytics. Other classification models include Support Vector Machine (SVM), Naïve Bayes classifier (NBC) and Artificial Neutral Network (ANN) (Alekh et al., 2021). SVM algorithms are used to classify binary data and are applied to multi-class classification problems (Alekh et al., 2021). NBC is used to classify data into predefined classes and determine the probability of new data items in every class. The final class is the one with the highest probability (Alekh et al., 2021). The Artificial Neural Network (ANN) is a supervised learning method. ANN is
Descriptive analytics methods in big data 301 the process of supervised learning where data (or classes) is separated into different classes by identifying the common features between the classes (Wang et al., 2019). Data visualisation Data analytics is important, and equally important is how the analysed data is presented and displayed to a decision-maker (Tang et al., 2017). Visualisation in Big Data is a critical component, as it can lead to incisive insights about data that can support decision-making and planning (Mulerikkal et al., 2021; Ko and Chang, 2018). Data visualisation software has the capabilities to manage large datasets. This software allows users to create interactive dashboards easily, and to connect to more data such as SQL databases, spreadsheets, and cloud platforms (Ko and Chang, 2018). The most common data visualisation tools in the reviewed literature include Tableau (Tang et al., 2017; Ko and Chang, 2018), MindReduce (Mehta and Pandit, 2018), Hadoop (Mehta and Pandit, 2018), and Qlikview (van Rijmenam et al., 2019). Descriptive statistics Descriptive statistics is a key component of descriptive analytics, which helps to answer the question of “what” happened (Appelbaum et al., 2017). Descriptive statistics techniques include measures of frequency (i.e., count, percentage, and frequency), central tendency (mean, median, and mode), dispersion or variation (range, variance, standard deviation), and so on. The output of descriptive statistics is visualised using data visualisation techniques and tools. A majority of the reviewed research combines different methods to provide a solution to the research problem. This shows that the same method can be used to solve various research problems. Table 18.5 depicts the categories of methods and the research papers that have applied or reviewed the method. In the reviewed literature, the commonly applied descriptive analytics method stands as a clustering/classification model and descriptive statistics supporting data visualisations. Some reviewed papers have discussed or applied more than one category, as those papers have combined different methods in their discussion. Table 18.5
Classification of papers according to descriptive analytics methods
Descriptive analytics
References
Count
categories of methods Association Rule Model
(Jun et al., 2015; Lian et al., 2020)
2
Clustering and
(Jun et al., 2015; Gahar et al., 2019; Alekh et al., 2021; Appelbaum et al., 2017;
10
classification models
Mulerikkal et al., 2021; Sharma and Joshi, 2020; Lock et al., 2021; Wang et al., 2019;
Data mining
(Jun et al., 2015; Appelbaum et al., 2017; Chae, 2015; Alekh et al., 2021)
4
Data visualisations
(Alekh et al., 2021; Appelbaum et al., 2017; Ko and Chang, 2018; Aryal et al., 2020;
6
Descriptive statistics
(Appelbaum et al., 2017; Chae, 2015; Ko and Chang, 2018; Jiménez et al., 2019;
Zhang et al., 2021; Nasir and Sassani, 2021)
Lock et al., 2021; Mulerikkal et al., 2021) 8
Yacchirema et al., 2018; Jun et al., 2015; Mulerikkal et al., 2021; Aryal et al., 2020) Pattern Recognition
(Mulerikkal et al., 2021; Zhang et al., 2021)
2
Regression
(Jun et al., 2015; Mulerikkal et al., 2021; Wang et al., 2019; Ko and Chang, 2018;
5
Jiménez et al., 2019)
(Al-Sai et al., 2020; Gahar et al., 2019; Lepenioti et al., 2021;
IEEE Access
(Mulerikkal et al., 2021) (Wang et al., 2019) (Appelbaum et al., 2017) (Nasir and Sassani, 2021) (Alekh et al., 2021) (Mehta and Pandit, 2018; Ko and Chang, 2018) (Xu et al., 2021)
IEEE Transactions on Intelligent Transportation Systems
IEEE Transactions on Smart Grid
International Journal of Accounting Information Systems
International Journal of Advanced Manufacturing Technology
International Journal of Contemporary Hospitality Management
International Journal of Medical Informatics
International Journal of Physical Distribution & Logistics
(Jiménez et al., 2019)
(Swaminathan, 2018)
Production and Operations Management
Transportation Research Part D: Transport and Environment
(van Rijmenam et al., 2019)
Journal of Management Information Systems
Long Range Planning (Aryal et al., 2020)
(Grover et al., 2018)
Journal of Intelligent Manufacturing
(Zhang et al., 2021)
(Lee and Chien, 2020)
Journal of Humanitarian Logistics and Supply Chain Management
Sustainability (Switzerland)
(Sharma and Joshi, 2020)
Journal of Computer Information Systems
Supply Chain Management
(Chae, 2015) (Bedeley et al., 2018)
International Journal of Production Economics
Management
(Stahl et al., 2021)
IEEE Transactions on Engineering Management
Yacchirema et al., 2018)
(Jun et al., 2015)
Emerging Markets Finance and Trade
1
(Lock et al., 2021) (Sheng et al., 2021)
Big Earth Data
British Journal of Management
(Tang et al., 2017)
Behaviour and Information Technology
1
1
1
1
1
1
1
1
1
1
1
2
1
1
1
1
1
1
4
1
1
1 1
(Lian et al., 2020)
Count
References
Journal name
Classification of papers according to the publication journal with a ranking of A/A* in the ABDC Quality list or Q1 in the SCImago Rank in 2021
Accident Analysis and Prevention
Table 18.6
302 Handbook of big data research methods
Descriptive analytics methods in big data 303 Table 18.6 presents the classification of the reviewed literature according to their publication journal. All reviewed journals in this chapter are from a ranking of A/A* in the ABDC (Australian Business Deans Council) Journal Quality list or Q1 in the SCImago Journal Rank in 2021. The majority of reviewed journals are published by the Institute of Electrical and Electronics Engineers (IEEE) society. As presented in Table 18.7, the reviewed literature also comes from journals publishing in Information Technology, science/ medicine, management, finance/ accounting, supply chain management and transportation Table 18.7
Classification of papers according to the application of descriptive analytics methods
Application domain/field
References
Accounting/Auditing
(Tang et al., 2017; Appelbaum et al., 2017) 2
Count
Business strategy and analytics
(Bedeley et al., 2018; Grover et al., 2018;
3
van Rijmenam et al., 2019) Energy and Power
(Wang et al., 2019)
Healthcare
(Yacchirema et al., 2018; Mehta and Pandit, 3
1
2018; Ko and Chang, 2018) Humanitarian operations
(Swaminathan, 2018; Sharma and Joshi,
2
2020) Marketing
(Jun et al., 2015)
1
Supply Chain Management (SCM,
(Stahl et al., 2021; Chae, 2015; Xu et al.,
6
manufacturing)
2021; Zhang et al., 2021; Aryal et al., 2020; Lee and Chien, 2020)
Tourism
(Alekh et al., 2021)
1
Transport
(Mulerikkal et al., 2021; Lian et al., 2020;
4
Jiménez et al., 2019; Lock et al., 2021) Education/Research
(Nasir and Sassani, 2021; Lepenioti et
5
al., 2020; Al-Sai et al., 2020; Gahar et al., 2019; Sheng et al., 2021)
4.2
The Benefits and Challenges of Descriptive Analytic Methods in BDA
As highlighted in the previous sections, descriptive analytics has been applied to various industries. Although organisations are quick to adapt BDA in decision-making, there are both benefits and challenges to Big Data technologies. Research shows that descriptive analytics have the capability to support various activities in the value chain of an organisation. Bedeley et al. (2018) discuss the descriptive analytic capabilities to support firm infrastructure (i.e., financial analytics for better visibility factors influencing revenue and cost); human resource management (i.e., dashboards and scorecards); technology development (i.e., heat maps to determine potential technology issues across the organisation); procurement (i.e., interactive visualisation analytics). Descriptive analytic techniques such as interactive visualisations can be used for inbound logistics, pattern recognition for performance improvement, cluster analysis for network design, and speech/ social media/ web mining for customer requirement recognition. In supply chain planning (SCP) Big Data can be applied for demand planning and fulfilment, purchasing and material requirement planning, distribution and transport planning,
304 Handbook of big data research methods production planning and scheduling (Xu et al., 2021). According to Xu et al. (2021), Big Data in SCP has three roles: (1) supportive facilitator; (2) source of empowerment; and (3) game-changer. Descriptive analytics can support the organisations coping with a high volume of data, and present those data in dashboards and data visualisation solutions. Big Data analytics that generates descriptive data allows consistency in viewing data and enhances the visibility of an organisation’s business processes and outcomes (Grover et al., 2018). For example, dashboards or interactive visualisations can be used to provide real-time data on firm activities. BDA supports organisational value creation by improving the quality of decision-making, improving the efficiency of business processes, and supports innovation and customer experiences. Furthermore, descriptive analytics supports organisations to understand their business context, that is, sensing (van Rijmenam et al., 2019). Many organisations apply BDA with descriptive analytics to sense changes in the business environment and understand customer needs. Descriptive analytics is also adapted to explore markets and understand technologies, customers, suppliers and competitors. This chapter proposes that descriptive analytics allows organisations to sense opportunities in times of uncertainty and to help organisations respond to changes in the environment. Although firms have invested in Big Data analytics, only a small percentage of firms are efficiently applying and using analytics (Grover et al., 2018). Most firms struggle to identify the systems they need to capture the data that they require. For these firms, the biggest challenge is not technology or the availability of data, but the human capital required to strategise Big Data analytics (Grover et al., 2018). In healthcare research, descriptive analytics can support imaging results and laboratory reports in healthcare (Mehta and Pandit, 2018). Cluster analysis can be applied in healthcare to identify high-risk groups. Visualisations analytics such as graphs can support the analysis of healthcare service performance based on various quality measures. However, there are challenges to Big Data (and descriptive analytics) in healthcare, such as the integration of structured, semi-structured and unstructured data from various sources. This can result in data inaccuracy, inconsistency, standardisation issues and increased costs. Governance issues can result in patient privacy and confidentiality leaks (Mehta and Pandit, 2018). Lee and Chien (2020) elaborate on common challenges in BDA. How to collect data? How much data is needed? How to merge data from different sources? Other challenges include: identifying protocols for missing values in data sets, identifying important variables in data, and determining the reliability of the conclusions derived from descriptive analytics. Despite the challenges, scholars have discussed significant developments in science and health using Big Data technologies. For instance, Yacchirema et al. (2018) propose a system focused on the healthcare of older adults which supports real-time monitoring of Obtrusive sleep apnoea (OSA) in older adults and guides their treatment plans. This system takes advantage of Big Data tools to perform descriptive analysis of data to understand the health evolution of people with sleep apnoea. Likewise, Lian et al. (2020) discuss the opportunities for descriptive analytics in Big Data safety analytics. They note that descriptive analytics in Big Data can be used for crash detection, the discovery of factors contributing to crashes, driving behaviour analysis and crash hotspot detection. Techniques such as classification, regression, association, clustering, and visualisation models have been used in safety analytics. A summary of the key benefits and challenges identified in the reviewed literature is presented in Table 18.8.
Descriptive analytics methods in big data 305 Table 18.8
Summary of benefits and challenges of descriptive analytics in BDA
Benefits
Challenges
Capability to support organisational value chain activities,
Integration of structured, unstructured and semi-structured data
such as financial analytics, dashboards for human resource
from various sources, can lead to standardisation issues.
management, heat maps for sales revenue, interactive visualisations for marketing. Descriptive analytic methods (i.e., visualisations, pattern
Assessing the reliability of data: is the data accurate? Is it
recognition, cluster analysis) can support logistic management,
consistent or are values missing?
customer requirement recognition, marketing and sales efficiency. Useful for safety analytics in transport, i.e., crash detection,
Issues of privacy and confidentiality in handling data without
driving behaviour analysis, crash hotspot detection.
proper systems of governance.
Descriptive analytic methods can support better data viewing of
The costs of handling large amounts of data.
business processes and outcomes. Businesses can conduct stakeholder analysis using descriptive
Firms struggling to determine the systems they need to conduct
analytics, to support them in decision making.
data collection and data analytics.
Advance decision-making in healthcare and science to support
Issues with human talent, with regard to firms having capabilities
patient monitoring, understanding evolution of diseases, causes
to analyse the data in a way that supports business process
of illnesses, identifying high-risk groups and quality measures.
improvement.
5.
FUTURE RESEARCH DIRECTION FOR BDA USING DESCRIPTIVE ANALYTICS
There are three key research directions we can propose. First, in relation to deep learning, we join Nasir and Sassani (2021) in calling for a more sustained and comprehensive empirical examination of deep learning concepts vis-à-vis traditional learning models. The use of data fusion methods and complex hybrid models could be examined at different levels. These could help researchers consider the effects of the problematic lab vs. real world conditions. The relevance of incremental and transfer learning could play a critical role in reducing the differences between lab and real-world conditions. Second, with regard to methods adopted in data analytics research, creative research designs in both quantitative and qualitative approaches could further help develop our understanding of the productive use of data analysis. Moving beyond annual reports and other secondary data analysis-based studies, the use of multiple data sources and methods over a longer period of time could provide robust and highly credible results. Third, researchers could make use of the array of problem-based research opportunities on offer. For example, the sustainability agenda has crept into most firms’ strategic decision-making processes. Descriptive analytics could be used to investigate how consumer attitudes influence businesses before and after the adoption of key sustainability measures. Similarly, the COVID-19 pandemic as an exogenous shock to most firms and industries has caused unparalleled impact on operations, reputation and revenue. Descriptive analytics could play a vital role in mapping these changes and predicting patterns of reactions of different firms in different industries across nations, and over time. Further, using advanced analytics methods, researchers could address the void in our understanding of wholesale clients’ role in information sharing and demand forecasting across supply chains. Innovations in data-driven
306 Handbook of big data research methods approaches to supply chain management are mainly followed up by B2C firms (Stahl et al., 2021). Different descriptive (advanced) analytics methods could be examined in both research and practice.
6.
CHAPTER SUMMARY
This chapter identifies that despite the trends and interest in Big Data, the scholarly critiques of Big Data analytics is still lagging – especially in relation to descriptive analytics. This review suggests potential benefits but also challenges of Big Data analytics. Descriptive analytics, although often employed by practitioners in the business world, is seldom examined in academic journals. This chapter brings attention to the importance of descriptive analytics for Big Data analytics, the existing limitations and future opportunities for academics and practitioners.
REFERENCES Al-Sai, Z.A., Abdullah, R. and Husin, M.H. (2020) Critical success factors for big data: A systematic literature review. IEEE Access 8: 118940–56. Alekh, G., Aggarwal, S. and Erdem, M. (2021) Reading between the lines: Analyzing online reviews by using a multi-method Web-analytics approach. International Journal of Contemporary Hospitality Management 33: 490–512. Appelbaum, D., Kogan, A., Vasarhelyi, M. and Yan, Z. (2017) Impact of business analytics and enterprise systems on managerial accounting. International Journal of Accounting Information Systems 25: 29–44. Aryal, A., Liao, Y., Nattuthurai, P. and Li, B. (2020) The emerging big data analytics and IoT in supply chain management: A systematic review. Supply Chain Management 25: 141–56. Babiceanu, R.F. and Seker, R. (2016) Big Data and virtualization for manufacturing cyber-physical systems: A survey of the current status and future outlook. Computers in Industry 81: 128–37. Bedeley, R.T., Ghoshal, T., Iyer, L.S. and Bhadury, J. (2018) Business analytics and organizational value chains: A relational mapping. Journal of Computer Information Systems 58: 151–61. Boyd, D. and Crawford, K. (2012) Critical questions for big data: Provocations for a cultural, technological, and scholarly phenomenon. Information, Communication & Society 15: 662–79. Buchmüller, C., Hemsen, H., Markl, V., Schermann, M., Bitter, T., Hoeren, T. and Krcmar, H. (2014) Big Data: An interdisciplinary opportunity for information systems research. Business & Information Systems Engineering 6(5): 261–6. Chae, B. (2015) Insights from hashtag #supplychain and Twitter Analytics: Considering Twitter and Twitter data for supply chain practice and research. International Journal of Production Economics 165: 247–59. Duan, L. and Xiong, Y. (2015) Big data analytics and business analytics. Journal of Management Analytics 2: 1–21. Fahad, A., Alshatri, N., Tari, Z., Alamri, A., Khalil, I., Zomaya, A.Y., Foufou, S. et al. (2014) A survey of clustering algorithms for big data: Taxonomy and empirical analysis. IEEE Transactions on Emerging Topics in Computing 2: 267–79. Fan, J., Han, F. and Liu, H. (2014) Challenges of big data analysis. National Science Review 1: 293–314. Gahar, R.M., Arfaoui, O., Hidri, M.S. and Hadj-Alouane, N.B. (2019) A distributed approach for high-dimensionality heterogeneous data reduction. IEEE Access 7: 151006–1022. Greasley, A. and Edwards, J.S. (2021) Enhancing discrete-event simulation with big data analytics: A review. Journal of the Operational Research Society 72: 247–67. Grover, V., Chiang, R.H.L., Liang, T.-P., and Zhang, D. (2018) Creating strategic business value from big data analytics: A research framework. Journal of Management Information Systems 35: 388–423.
Descriptive analytics methods in big data 307 Gunasekaran, A., Papadopoulos, T., Dubey, R., Fosso Wamba, S., Childe, S.J., Hazen, B. and Akter, S. (2017) Big data and predictive analytics for supply chain and organizational performance. Journal of Business Research 70: 308–17. Günther, W.A., Mehrizi, M.H.R., Huysman, M. and Feldberg, F. (2017) Debating big data: A literature review on realizing value from big data. The Journal of Strategic Information Systems 26: 191–209. Jiménez, J.L., Valido, J. and Molden, N. (2019) The drivers behind differences between official and actual vehicle efficiency and CO2 emissions. Transportation Research Part D: Transport and Environment 67: 628–41. Jun, S., Park, S. and Jang, D. (2015) A technology valuation model using quantitative patent analysis: A case study of technology transfer in big data marketing. Emerging Markets Finance and Trade 51: 963–74. Kitchin, R. and McArdle, G. (2016) What makes Big Data, Big Data? Exploring the ontological characteristics of 26 datasets. Big Data & Society 3: 2053951716631130. Ko, I. and Chang, H. (2018) Interactive data visualization based on conventional statistical findings for antihypertensive prescriptions using National Health Insurance claims data. International Journal of Medical Informatics 116: 1–8. Labrinidis, A. and Jagadish, H.V. (2012) Challenges and opportunities with big data. Proceedings of the VLDB Endowment 5: 2032–3. Larson, D. and Chang, V. (2016) A review and future direction of agile, business intelligence, analytics and data science. International Journal of Information Management 36: 700–10. Lee, C.-Y. and Chien, C.-F. (2020) Pitfalls and protocols of data science in manufacturing practice. Journal of Intelligent Manufacturing 1–19. Lepenioti, K., Bousdekis, A., Apostolou, D. and Mentzas, G. (2020) Prescriptive analytics: Literature review and research challenges. International Journal of Information Management 50: 57–70. Lepenioti, K., Bousdekis, A., Apostolou, D. and Mentzas, G. (2021) Human-augmented prescriptive analytics with interactive multi-objective reinforcement learning. IEEE Access 9: 100677–93. Lian, Y., Zhang, G., Lee, J. and Huang, H. (2020) Review on big data applications in safety research of intelligent transportation systems and connected/automated vehicles. Accident Analysis & Prevention 146. Lock, O., Bednarz, T. and Pettit, C. (2021) The visual analytics of big, open public transport data: A framework and pipeline for monitoring system performance in Greater Sydney. Big Earth Data 5: 134–59. Martinez, E.C., Cristaldi, M.D. and Grau, R.J. (2009) Design of dynamic experiments in modeling for optimization of batch processes. Industrial & Engineering Chemistry Research 48: 3453–65. Mehta, N. and Pandit, A. (2018) Concurrence of big data analytics and healthcare: A systematic review. International Journal of Medical Informatics 114: 57–65. Mulerikkal, J., Thandassery S., Dixon K., D.M., Rejathalal, V. and Ayyappan, B. (2021) JP-DAP: An intelligent data analytics platform for metro rail transport systems. IEEE Transactions on Intelligent Transportation Systems. Najafabadi, M.M., Villanustre, F., Khoshgoftaar, T.M., Seliya, N. and Muharemagic, E. (2015) Deep learning applications and challenges in big data analytics. Journal of Big Data 2: 1–21. Nasir, V. and Sassani, F. (2021) A review on deep learning in machining and tool monitoring: Methods, opportunities, and challenges. International Journal of Advanced Manufacturing Technology 115: 2683–709. Nguyen, T., Li, Z., Spiegler, V., Ieromonachou, P. and Lin, Y. (2018) Big data analytics in supply chain management: A state-of-the-art literature review. Computers & Operations Research 98: 254–64. Ross, J.W., Beath, C.M. and Quaadgras, A. (2013) You may not need big data after all. Harvard Business Review 91. Sharma, P. and Joshi, A. (2020) Challenges of using big data for humanitarian relief: Lessons from the literature. Journal of Humanitarian Logistics and Supply Chain Management 10: 423–46. Sheng, J., Amankwah‐Amoah, J., Khan, Z. and Wang, X. (2021) COVID‐19 pandemic in the new era of big data analytics: Methodological innovations and future research directions. British Journal of Management 32: 1164–83.
308 Handbook of big data research methods Soltanpoor, R. and Sellis, T. (2016). Prescriptive analytics for big data. In M.A. Cheema, W. Zhang and L. Chang (eds), Databases Theory and Applications. Paper presented at the 27th Australasian Database Conference: ADC 2016. Sydney: Springer International Publishing, pp. 245–56. Stahl, C., Stein, N. and Flath, C.M. (2021) Analytics applications in fashion supply chain management: A review of literature and practice. IEEE Transactions on Engineering Management. Sun, G.-D., Wu, Y.-C., Liang, R.-H. and Liu, S.X. (2013) A survey of visual analytics techniques and applications: State-of-the-art research and future challenges. Journal of Computer Science and Technology 28: 852–67. Sun, Z., Sun, L. and Strang, K. (2018) Big data analytics services for enhancing business intelligence. Journal of Computer Information Systems 58: 162–9. Swaminathan, J.M. (2018) Big data analytics for rapid, impactful, sustained, and efficient (RISE) humanitarian operations. Production and Operations Management 27: 1696–700. Tabesh, P., Mousavidin, E. and Hasani, S. (2019) Implementing big data strategies: A managerial perspective. Business Horizons 62: 347–58. Tang, F., Norman, C.S. and Vendrzyk, V.P. (2017) Exploring perceptions of data analytics in the internal audit function. Behaviour and Information Technology 36: 1125–36. Tranfield, D., Denyer, D. and Smart, P. (2003) Towards a methodology for developing evidence‐ informed management knowledge by means of systematic review. British journal of management 14: 207–22. Tsai, C.-W., Lai, C.-F., Chao, H.-C. and Vasilakos, A.V. (2015) Big data analytics: A survey. Journal of Big Data 2: 1–32. Van Rijmenam, M., Erekhinskaya, T., Schweitzer, J. and Williams, M.A. (2019) Avoid being the Turkey: How big data analytics changes the game of strategy in times of ambiguity and uncertainty. Long Range Planning 52. Wamba, S.F., Akter, S., Edwards, A., Chopin, G. and Gnanzou, D. (2015) How ‘big data’ can make big impact: Findings from a systematic review and a longitudinal case study. International Journal of Production Economics 165: 234–46. Wang, Y., Chen, Q., Hong, T. and Kang, C. (2019) Review of smart meter data analytics: Applications, methodologies, and challenges. IEEE Transactions on Smart Grid 10: 3125–48. White, M. (2012) Digital workplaces: Vision and reality. Business Information Review 29: 205–14. Xu, J., Margherita Emma Paola, P., Ciccullo, F. and Sianesi, A. (2021) On relating big data analytics to supply chain planning: Towards a research agenda. International Journal of Physical Distribution & Logistics Management 51: 656–82. Yacchirema, D.C., Sarabia-Jacome, D., Palau, C.E. and Esteve, M. (2018) A smart system for sleep monitoring by integrating IoT with big data analytics. IEEE Access 6: 35988–6001. Zhang, S., Huang, K. and Yuan, Y. (2021) Spare parts inventory management: A literature review. Sustainability (Switzerland) 13: 1–23. Zhou, Z.-H., Chawla, N.V., Jin, Y. and Williams, G.J. (2014) Big data opportunities and challenges: Discussions from data analytics perspectives [discussion forum]. IEEE Computational Intelligence Magazine 9: 62–74.
Index
Abbass, H. 149–50 Accelerated Data Science (ADS) SDK 166 Accenture 202 accounting 32–47 financial accounting 43–4 importance 39–40 management accounting 44–5 Adalasso 22 Adämmer, P. 15 aggregation of data 4, 78, 90 Agrawal, A. 15 Ahmad, A. 74 Albert AI platform 81 algorithms 5, 25, 37, 39, 88 Alibaba 3 AlphaGo 153 Alteryx 87 Altman, E.I. 12 Aluri, A. 74 Amazon 1, 3, 45, 80, 152, 158, 165, 166, 172, 181, 183, 202, 279 Alexa 76 Athena 176 Australia 185 Cloud 166 EC2 Spot Instances 173 EMR 176 Glue 176 SageMaker 166–7, 174 American Accounting Association (AAA) 37 American Express 181 American Institute of Certified Public Accountants (AICPA) 37 Assurance Services Executive Committee 43 analysis of data 4–5, 6, 98–9, 109–111, 149, 152 analysis of variance 98 analytics 37, 75, 172, 279 Andersen, M.K. 187 anomaly detection 166 Apex Parks 223 Appelbaum, D. 91 Apple 39, 82, 152, 279 Application Programming Interface (API) 242 approximation (sampling or lines embedding) methods 133 a priori algorithm 159–60, 299 Argonne National Laboratory 120 Ari, A. 19
ARIMA (AutoRegressive Integrated Moving Average) model 15, 22, 105, 108 Arnaboldi, M. 45 artificial intelligence (AI) accounting 33–6 data science ecosystem using public cloud 165, 169, 174, 176 data-driven analytics (DDA) and HRM 195, 202 descriptive analytics 296, 299 e-commerce 86–7, 94, 96, 99 fashion retailing 73–4, 76, 79–80, 81 financial prediction 11, 13, 22–3, 26 GPC for psychophysical detection tasks using transfer learning 143 HR analytics 181–3, 185, 188, 189 predictive analytics and decision-making 117, 120, 124 predictive analytics for machine learning and deep learning 148–50, 152, 154, 160, 162 see also customer relationship management (CRM) systems and artificial intelligence (AI) artificial intelligence as a service (AlaaS) 176 artificial neural networks (ANN) 12, 18, 22–3, 74, 300–301 Arunachalam, H.B. 161 Arya 183 Association of Chartered Certified Accountants (ACCA) 32, 35 association rules 5, 74, 78, 90–91, 92, 158–60, 299, 304 descriptive analytics 299, 301 Association to Advance Collegiate Schools of Business (AACSB) 37 AT&T 290 Athey, S. 12 auditing 42–3, 188–9 Auditing Standards Board Task Force 43 Augmented Reality 76 Australia 15, 168, 176, 276 Broadcasting Corporation (ABC) 185 auto-correlation 114 auto-scaling 166 automation 79, 99, 200–203, 205, 207, 267 AutoML 169, 174 Autopilot function 166 average 95, 96, 98, 112
309
310 Handbook of big data research methods Avianca 223 Awasthi, P. 289 Axis bank 223 Azimi, M. 15 Azure 166, 172 backpropagation algorithm 18, 154 BambooHR 201 Bank of America 195 Bansal, B. 94 Bansal, S. 94 Barbour, D.L. 142 barcodes 43 Bastos, J.A. 19 Bayes rule 132, 135, 136 Bayesian Active Learning by Disagreement (BALD) 142 Bayesian extreme learning machine (BELM) 22 Bayesian inference 133–4 Bayesian interrupted time-series (BITS) modeling for intervention analysis 105–114 analysis of data 109–111 causal impact findings 112 change detection 113 extraction of data 109 findings 111–13 intervention date 111 Life-IP Clean Air Policy case study (Bulgaria) 106, 108, 114 policy intervention analysis on sulphur dioxide level 111–13 software tools 110–111 Bayesian modeling 130, 132, 166, 297 BBC 189 BDO 181 Bear Stearns 223 Bedeley, R.T. 303 Belgium 22 Bell Labs 154 Bellotti, T. 19 Bellovary, J.L. 17 benchmark random walk model 15 benchmarking 62, 202 Beyca, O.F. 23 Bianchi, D. 14 bias confirmation and unconscious 188 data snooping 13 ethnicity 81–2, 182–3, 189 gender 82, 182–4, 188–9, 202, 207 human 183 minimisation 189 omitted variable 11, 109 recruitment 183 restriction 133
sample selection 11, 13 subjective 182, 184, 188 biased algorithms 81–2, 181, 184 Big Data Value Chain (BDVA) 72 big open linked data (BOLD) 1 binary probit benchmark model 15 bipartite networks 223, 224–5, 227 block modeling 222 blockchain technology 34, 40–41, 185 BNSF 120 Bondarouk, T. 196 Boolean operators (AND, OR) 75, 139 Boudreau, J.W. 180 box plots 90, 168 breach of data 267, 272, 274 Breezy HR 183 Bresnick, J. 161 Brownlee, J. 152 Buehlmaier, M.M. 15, 25 business intelligence tools 41, 91 business rules 39 Canada 121–2, 124 Canopy 166–7, 170 Cao, S. 25 capital asset pricing model 155 Capital One 1, 181 Carmona, P. 18 Carrefour 290 categorization 77, 88, 95, 99 CausalImpact R package 110–111 central tendency 91, 96 chain rule 154 Chang, Y. 88 change point 110 Chartered Global Management Accountant (CGMA) 32–3 charts 77, 78, 91–2, 95, 112, 296 bar 90 coxcomb 90 pie 90 scatter 90 chatbots 76, 79, 81, 176 Chatterjee, S. 279, 287 Chauhan, A.S. 2 Chen, I.J. 286 Chen, S. 19 Chi, N.T.K. 282 Chien, C.-F. 304 China 15, 19, 271 Chiroma, H. 21 choice models 2 churn 74, 78, 123, 160, 279 Clarifai 81
Index 311 classification 18, 77–8, 120–21, 134, 155, 158, 299–303 text 15, 80, 88 see also under Gaussian process classification (GPC) for psychophysical detection tasks using transfer learning classification and regression tree (CART) 300 cleansing/cleaning of data 4, 6, 34, 77, 90, 168, 176 cloud platforms 4, 173, 279, 301 see also data science ecosystem using public cloud clustering descriptive analytics 299–303, 304 descriptive analytics and data visualization in e-commerce 87, 88, 91, 93, 95, 97, 98, 99 fashion retailing 74, 78, 79–80 hierarchical 222, 299 k-means 74, 299–300 k-nearest neighbour 19, 161, 300 predictive analytics and decision-making 117 predictive analytics for machine learning and deep learning 158 social network analysis using design science 225 spectral 299 clustering coefficient 218–19, 223 CMX Cinemas 223 cognitive computing 88, 131 Cokins, G. 44 collection of data 3–4, 36, 93 BITS modeling for intervention analysis 105 CRM systems and AI 279 data science ecosystem using public cloud 167–8 data-driven analytics (DDA), HRM and organization performance measurement 203 fashion retailing 75, 76 HR analytics 187 Notre-Dame fire and Twitter 242–3 predictive analytics and decision-making 124 predictive analytics for machine learning and deep learning 152 social network analysis (SNA) using design science 224–5 comparative analytics 5 competitive advantage 4, 86, 89, 99 CRM systems and AI 280, 289 data-driven analytics (DDA) and HRM 195–7 predictive analytics for machine learning and deep learning 148–9, 150, 154, 161–2
social network analysis using design science 214 Complex Adaptive Systems (CAS) model analysis 271–4, 276 application to personal data handling 271–2 background 271 complexity 274 definition 271 dynamic system 274 limitations 275 personal data community (PDC) 271, 272–5 systematic system 273 compressed sensing-based denoising (CSD) 22 computer science 75, 149 Con Edison 120 confidence intervals 108, 111–12, 132 confidentiality see privacy and confidentiality contrast code variable 244, 260 cookies or caches 55 coordination issues 187 correlation coefficient 90 correlation of data 62 correlation plots 168 Costa, A. 22 counterfactual values 107, 109, 111, 112 covariance 90, 134 COVID-19 pandemic 93, 95, 223, 290 accounting 40 BITS modeling for intervention analysis 106 descriptive analytics 305 HR analytics 182, 186 personal data protection 267, 274 predictive analytics for machine learning and deep learning 151, 161 Cowan, Z. 268 Cox, M. 142 CPU 166 Crawford 54 credit evaluation model 19 Crook, J. 19 Crunchhr 201 CUMSUM (Cumulative Sum) 110, 111 customer journey mapping 56 Customer Lifetime Value (CLV) 74 Customer Referral Value (CRV) 291 customer relationship management (CRM) 86, 123 customer relationship management (CRM) systems and artificial intelligence (AI) 279–91 AI-based CRM dimensions and sub-dimensions 286–90 defining AI-based CRM 282–3 defining CRM 2802 importance 283–6
312 Handbook of big data research methods processes 284–6 research methodology 280 customization 175–6, 268 cyber-security and cyber attacks 161, 267 Dagoumas, A.A. 23 Dahlbom, P. 180–81 dashboards 89–92, 95–6, 99, 174, 202–4, 206, 301, 303–4 data binge 6 data lakes 172, 174 data science 148, 152, 154, 162 data science ecosystem using public cloud 165–77 benefits 172–4 capacity on demand and cost reduction 173 collaboration 172 governance of data 173 production rollout 173–4 specialized skills not required 174 tracking data 172 challenges 169–72 collaboration 170 compute capacity requirement 171 governance of data 171 model deployment in production environment 171–2 specialized skills requirement 172 tracking data 170 feature engineering 169 financial industry case studies 174–7 training, deployment and monitoring 169 data-driven analytics (DDA), HRM and organization performance measurement 195–207 applications 202 embracing HRA success 201 findings and analysis 203–5 future research 207 HRA experience 200 HRA gauge 200 HRA realm 200 limitations 207 literature review 197–9 evolution of HR analytics 197–8 HR analytics definition and terminologies 198–9 managerial implications 206 ontogeny of HRA 199–200 questionnaire development 203 research methodology 203 sapience analytics HR professionals 200 theoretical implications 206 Datapine 41
Davenport, T.H. 2, 3, 5, 6 De Bruyn, A. 82 De Vries, B. 142 decision science 6, 75 decision trees 18–19, 78, 79, 117, 121, 157–8, 161, 297, 300 decision-making 2, 4, 6, 36, 78, 86–7 see also predictive analytics and decision-making; strategic decision-making Deep Blue 153 deep learning accounting 40 data science ecosystem using public cloud 165–6 descriptive analytics 305 fashion retailing 74, 81 financial prediction 11–13, 15, 16, 19, 21–2 see also predictive analytics for machine learning and deep learning Deloitte 186, 201 Delta Airlines 284 Denmark 142 descriptive analytics 4–5, 38–9, 295–306 accounting 38–9 benefits and challenges 303–5 data-driven analytics (DDA) and HRM 203 descriptive statistics 301–3 e-commerce 87–8, 89–90, 91 fashion retailing 77, 78 future research direction 305–6 reviewed papers analysis 298–305 social network analysis using design science 214 systematic literature review process 297–8 descriptive analytics and data visualization in e-commerce 86–100 applications 92 business statistics 91 Flipkart case study (India) 93–9 findings and analysis 95–8 research design 95 research methodology 95 literature review 89–92 nature of analytics 90 research approach 93 research implications 99–100 types of analytics 90–91 value of descriptive analytics 88–9 design science process models 6 see also social network analysis using design science deviance 129 diagnostic analytics 4, 5, 38–9
Index 313 Dialogflow 176 difference-in-difference analysis 106, 107, 109 Digital India program 93 dimensionality 11–12, 14, 158, 160 disasters see Notre-Dame fire and Twitter discriminant analysis 18 Dixon, M. 55 Dorsey, J. 233 Dreyfus, S. 154 dual-weighted fuzzy proximal 18 Duke University 161 Dwivedi, Y.K. 1 Dynamic Capabilities (DC) theory 181, 206 dynamic factor model 22 e-commerce see descriptive analytics and data visualization in e-commerce education sector 119, 122 efficiency roadmap 201 eigenvector centrality 244 Einstein 360 customer platform 290 ElasticNet 22–3 elbow method 300 electronic health records (EHRs) 118, 121, 123–4 Elula 176–7 Emanuel, J.E. 151 Emerald 75 encompassing least absolute shrinkage and selection operator (E-LASSO)-based model 14 ensemble classifiers 19 ensemble empirical mode decomposition 16 Enterprise Resource Planning (ERP) 35, 44, 86 Erel, I. 12 ESM 108 ethical, moral and legal issues 81–2, 123, 181, 182, 186–8, 189, 270 ethnicity bias 81–2, 182–3, 189 Euler Hermes 223 European Court of Human Rights 269 evidence-based management theory 206 Expectation Propagation (EP) 137 explanatory models 36 extraction of data 4, 109, 149 extrapolative models 36 extreme gradient boosting 18 Facebook 34, 39, 42, 94, 120, 152, 183, 274 Facebook v Ireland (2016) 270 fairness checks 189 Farias, F.V. 149 fashion retailing 72–82 big data analytics 72–3 big data value chain model (BDVC) 75–9 analytics 77–9
exposition and modelling 79 ethical implications 81–2 findings 75 future prospects 79–81 chatbots for customer experience 81 clustering for customer segmentation and discovery 79–80 computer vision for branded object recognition 81 regression models for dynamic pricing 80 robotic processes for marketing operations 81 text classification for user insight and personalization 80 visualization for superior reporting 80 literature review 73–4 research approach 74–5 Fawcett, S.E. 148 Federal Bank India 176 Ferrari, D. 22 field codes 75 financial institutions 118–19, 161 financial prediction 11–26 asset returns 13–17 default risk 17–20 dimension 13 energy finance 21–4 future research directions 24–6 size 13 structure 13 first-party data 68 Fitz-Enz, J. 179, 201 fog computing 267 Forecast combinations 22 Forester Consulting 61 formatting 149 Forth Smart Thailand 175 FORTRAN code 154 Franzen, J. 269 fraud detection 175 frequency 90–91 scale 129 frequentist approach 130, 132 Frow, P. 286 Fuster, A. 19, 25–6 fuzzy logic 74 FWD Know Your Customer (KYC) 175 game theory 78 GAP 73 GapJumpers 189 GARCH 22 García, H.C. 161 Gardner, J.R. 142
314 Handbook of big data research methods Garnett, R. 142 Gartner 34 gathering data 66, 118, 122, 149, 151, 168 Gaussian process classification (GPC) for psychophysical detection tasks using transfer learning 128–43 audiogram estimation 129, 131–2, 141–2 air-conduction 141 historical background and active learning 142 transfer learning 141, 143 USA most active 142 audiology, computational (CA) 128 classification 132–3, 136–8 approximation schemes 137–8 hyperparameters 138 model selection 136–7 predictions 137 Gaussian distribution 132 Gaussian equation 132 Gaussian mixture model 142 machine learning 131–2 protocol, eligibility criteria and search 138–9 psychometric function and audiometry 129–31 errors estimation 131 fitting the function 130 goodness of fit or certainty 131 stimulus level 129–30 types and choice of function 130 regression 133, 135–6 model selection 1335 predictions 136 selection and extraction 139–41 inclusion, exclusion and quality criteria 139 paper and report selection process 139–40 summary of analysed papers 140–41 gender bias 82, 182–4, 188–9, 202, 207 General Data Protection Regulation (GDPR) 269–70 General Electric 181 generalized boosting 18 generalized linear models 14 genetic algorithm (GA) 5, 12, 23, 74, 117 Ghanbari, F. 153 Glass, N. 233 Goldman Sachs 184 Gold’s Gym 223 Goldstein, I. 11, 12 Google 34, 42, 78, 86, 152, 165, 166, 172, 173, 195, 200 Analytics 38, 73 Auto Complete 120
AutoML 175 Cloud 176 Cloud Vertex AI 174 Cloud Vision 175, 177 DeepMind 153 Google v Spain (2014) 270 Instant 120 Oxygen Project 186, 202 Recognic 177 Research 110 Suggests 120 Gopal, M. 150 Gotlieb, C. 268 GPUs 171 Grabit model 18 graphs 77, 90–91, 94–5, 111, 129, 202–3, 220, 296, 304 Greece 223 Green, J. 14 Gregor, S. 226 Griva, A. 74 Groot, P.C. 142 growth hacking techniques 282 Gu, S. 13, 14 Guha, A. 82 Guo, L. 4 Hadoop 301 Haier 3 Han, Y. 14 Hao, X. 22 Harvey, C.R. 14 Hawley, D.D. 12 HDFC bank 223 He, W. 4 healthcare sector 118, 121–3, 124, 161, 304 heat maps 65, 92, 303 Heaton, J.B. 12 Herrera-Flores, B. 152 Hevner, A.R. 226 Hewlett-Packard (HP) 86, 120 High-Impact People Analytics 186 Hillebrand, B. 289 Hinton, G.E. 154 HireVue 183 Hirnschall, C. 12, 18, 19 histograms 90, 168, 296 Ho, H.T. 18 Houlsby, N. 142 HR analytics 92, 120, 179–90 big data defined 180–81 definition and conceptualization 180 employee exploitation 182–5 maintenance and performance appraisal 184–5
Index 315 recruitment and selection 183–4 training, development and career progression 184 literature search techniques 182 well-being and welfare promotion 185–9 algorithm audits and soliciting blind CVs 188–9 confirmation and unconscious biases removal 188 ethical analysis 186–8 inclusive organizational culture 186 HR metrics 180 HSBC 175 Huang, D. 15 Huang, T. 15 Huerta, R.E. 161 Hugo, V. 241 human capital analytics 200 human capital management 206 human capital resource model 206 human resource information systems (HRIS) 195–7, 202 human resource management see data-driven analytics (DDA), HRM and organization performance measurement Humanyze 195 hybrid models 77, 107, 286, 305 hyperplanes 156–7 IBM 149, 150, 180, 184, 185, 202 Blue Match 186 Kenexa 201 Security 267 Smarter Workforce Institute 182 Watson 40, 121, 153 Talent Insights 184 Tone Analyser 80 ICICI Prudential 177, 223 identity theft 267, 275 IDFC First Bank 225 image analytics 122 image recognition 98, 152 impact models 109–110, 113–14 inaccuracy of data 304 inbound logistics 92 independent and identically distributed (IID) random variables 105 independent variables 109, 155 India 279 see also under descriptive analytics and data visualization in e-commerce; and under social network analysis (SNA) using design science inductive methods 133 information flow analysis 222
information theoretic approach 142 inquisitive analytics 4, 5 insight-based data 5–6, 76, 149 instantiation 222, 226 Institute of Chartered Accountants of England and Wales (ICAEW) 42 Institute of Management Accountants (IMA) 32, 35 integration of data 4, 6, 75, 77, 124, 290 Interactive Data Extraction and Analysis (IDEA) 41 International Accounting Standards (IAS) 35 International Financial Reporting Standards (IFRS) 35 Internet of Things (IoT) 120, 161–2, 181, 185, 267 intervention analysis see Bayesian interrupted time-series (BITS) modeling for intervention analysis ITS methodology 113–14 Iworiso, J. 15 Jagadish, H. 6 Jasilionienè, R. 289 Java 172 Jeffery, M. 62 Jobs, C.G. 54 Johns Hopkins University 123 Jones, S. 18 Jupyter Notebook 166, 171 JupyterLab 169 k-means clustering 74, 159, 299–300 k-nearest neighbors (KNN) algorithm 156, 157 k-nearest neighbour clustering 19, 161, 300 Kamiya, S. 12 Kandani, A.E. 12, 18 Kang, S.K. 183 Karnataka Bank 225 Keippold, M. 15 Kelleher, D.J. 149 Kelley, H.J. 154 Kellner, R. 18, 19 Kelly, B. 15 kernel function 134, 135–6, 138 kernel machine 23 kernel trick 135 Key Performance Indicators (KPIs) 38, 59, 78, 92 Kim, D.J. 3 Kim, J. 2, 3, 5, 6 Kingdom, F.A.A. 129 Kotak Mahindra bank 223 KSA (knowledge, skills, attitude) functions 206 Lago, J. 22
316 Handbook of big data research methods LAMP (logic, analytics, measures and process) framework 206 Laplace method 137–8 Latent Dirichlet Allocation (LDA) 244, 259 Lawler III, E.E. 201 learning active 142–3 analytics 119 applied 78 incremental 305 pool-based active 142 reinforced 131–2, 154, 160, 162 reward or punishment-based techniques 160 semi-supervised 88 supervised 88, 131, 133, 134, 151, 154–5, 157, 162, 166 transfer 141, 143, 305 see also Gaussian process classification (GPC) for psychophysical detection tasks using transfer learning unsupervised 88, 90–91, 131, 151, 154, 157, 158–9, 162, 166, 175–6 least absolute shrinkage and selection operator (LASSO)-based model 14–15, 22 LeCun, Y. 154 Lee, C.-Y. 304 Lee, C.K.H. 3, 5 Lehmann Brothers 223 Leigh, T.W. 279 Lessmann, S. 18 LGD estimations 19 Li, J. 22 Li, K. 25 Li, Y. 22 Lian, Y. 304 Libai, B. 287 Life-IP Clean Air Policy case study (Bulgaria) 106, 108, 114 Lilien, G.L. 2 LinkedIn 120, 279 links 216, 219–20, 222–3, 225, 239, 244, 251, 255 Lithuania 289 Liu, Y. 19 logic-based models 79 logistic function 156 logistics companies 161 logit model 18 McCarthy, J. 150 McClelland, J. 153 McCulloch, W. 153 machine learning (ML) 5 accounting 33, 34, 39, 41–2
data science ecosystem using public cloud 165–7, 168–70, 172, 174–7 data-driven analytics (DDA) and HRM 195, 200, 201 descriptive analytics 299 e-commerce 86–7, 88 fashion retailing 74–5, 76–7, 79, 80, 81, 82 financial prediction 11–14, 15, 16, 18, 19, 21–6 GPC for psychophysical detection tasks using transfer learning 128, 143 HR analytics 181, 183, 185 marketing analytics 52 personal data protection 268 predictive analytics and decision-making 117, 120 third-party 42 see also predictive analytics for machine learning and deep learning McKinsey Global Institute 32, 186 Mailchimp Marketing Glossary 52 Maimon, O. 157–8 Manatal 183 manipulative algorithms 184 manufacturing sector 161 maps 95 heat 65, 92, 303 March, S.T. 222 market basket analysis 160 market location principle 270 marketing analytics 52–69 A/B tests 67 analytical tools utilization 64 attribution model selection 62 audiences, relevant or underserved, recognition of 66 benchmark creation 62 capabilities assessment 54–5 comparison and contrast of collected data 57 competition 61 consumer and marketing technology trends 64 correlation of data 62 creative teams benefiting from analytics software 65 current abilities 63 customer data segmentation 57–8 customer interests and patterns 60 customer support analytics 60 data scientists, lack of 62 difficulties 61–2 diversity of methods 54 examination of data 67 features and capabilities of software 63–4 future prospects 61, 66, 68
Index 317 goal-oriented mind-set 57 impacts 56 importance 53, 64–6 insights 65 interest or population segmentation 65 knowing what to measure 62 marketing mix 53 marketing tools 57 measurable assertions 56 media and messaging 60–61 optimization recommendations 64 organic content interaction 67 origins of data 66–9 paid advertisements 67 possibilities 53–9 product development trends 60 product intelligence 60 programme installation 63 quality of data 58, 61, 64 quantity of data 61 recommended practices 59 return on investment 68 skills development 69 software design 63 surveys 66 transformation of data into knowledge 56 user experience enhancement 68 Marketing Measurement and Optimization Platforms 63 Markov Chain Monte Carlo (MCMC) 137 Markov model 297 Markov regime-switching 21 Marler, J.H. 180 Marr, B. 45 Marriott 1 Martinez-Rojas, M. 236 Maryland University 120 Mascio, D.A. 15 mathematical programming 79, 149, 297, 299 Maximum A Posterior (MAP) estimation 130 Maximum Likelihood Estimate (MLE) 130 mean 77, 90–91, 96, 107, 134–6, 243–4, 260–61, 296, 301 media mix models (MMM) 59, 62 median 77, 90–91, 95, 99, 113, 296, 301 median value 40, 96, 97, 98, 113 Melbourne University 188–9 Merrill Lynch 223 metrics 92, 98, 180, 201, 204–5, 206 Mexico University 161 Miah, S.J. 2 microblogging 235, 239, 261–2 Microsoft 162, 173, 184 Microsoft Excel 41, 67, 80, 201 Microsoft Office 365 Workplace Analytics 184
Microsoft Power BI 202 Milunovich, G. 15 MindReduce 301 mining of data 4, 6, 37, 78, 90, 92, 95 CRM systems and AI 282, 289 descriptive analytics 299, 300, 301 e-commerce 86–7, 88, 89 fashion retailing 74 HR analytics 185 predictive analytics and decision-making 117 predictive analytics for machine learning and deep learning 148–9 Miralles-Pechuán, L. 74 misuse of data 39, 56 MIT study 187 mode 77, 90, 91, 296, 301 model overfit and false discovery 14 model testing 3–4 modularity 219–20, 223, 227 module discovery problem 227 Monday.com 183 moneycontrol.com 224 Moneytree 167 Mullainathan, S. 12 Multi-Touch Attribution (MTA) 59, 62 multidimensional linear regression model 243, 259, 260, 263 multivariate adaptive regression spines 16 multivariate model 111 Murphy, K.P. 151 Naïve Bayes 15, 78, 79, 300 Najafzadeh, S. 153 Nasir, V. 305 natural language processing (NLP) 11, 13, 22, 25, 33, 80, 243–5 Natural Language and Speech-to-Text 175 Netflix 39, 45, 120, 150, 181, 279 Netherlands 142 neural networks 16, 157, 158 accounting 40 bidirectional long short-term memory 22 convolutional 15 deep 19, 22, 78, 79 descriptive analytics 297 e-commerce 92 fashion retailing 78 feedforward 21, 23 financial prediction 12, 14 predictive analytics and decision-making 117 predictive analytics for machine learning and deep learning 152, 153, 154, 161 recurrent 15, 19, 23 see also artificial neural networks (ANN) New Zealand 168, 176
318 Handbook of big data research methods Nguyen, Q. 23 nib Group 168 Nichols, W. 52 Nielsen, S. 45 Nielson, J.B.B. 142 Nike 80 Nilsson, N.J. 150 nodes 157, 216–20, 223, 225, 244 Nokia Siemens Networks 120 Northeastern University 183 Notre-Dame fire and Twitter 233–63 biological disasters 233 crisis communication 246 disaster management 234–5 dissemination of information 233, 235, 261, 262, 269 hashtags 236, 239–40, 242–4, 247–8, 250–51, 255, 260–63 hashtags in tweet negatively associated with retweet time 240 hypothesis development 239–40 information timeliness 234, 236, 239–40, 243, 259–63 information transmission process 245–58 activity days after fire 249–53 activity during fire 245–9 activity three weeks after fire 253–8 hashtag network days after fire 251 hashtag network during response phase 248 hashtag network three weeks after fire 255 languages 251, 258 languages days after fire 252 languages during response phase 249 languages three weeks after fire 256 location of tweets days after fire 252 location of tweets during response phase 248 location of tweets three weeks after fire 256 most frequently used words during response phase 245, 246 most frequently used words three weeks after fire 254 tweet counts 247, 253, 257 literature review 234–8 man-made disasters 233–4, 236, 238, 241 natural disasters 233–4, 235, 236 perspectives 262–3 preparedness phase 244, 246 reconstruction phase 255 recovery phase 236, 242, 244, 249, 251–3, 255, 259–62 research methodology 240–45
data collection 242–3 description of Notre-Dame 240–42 empirical methodology 243 fire disaster 241–2 keywords and hashtags used to collect tweets 242 natural language processing and social network analysis 243–5 variable descriptions and summary statistics 243 response phase 236, 242, 244, 248–9, 251, 259–63 results 259–62 double-topic model 259 implications 261–2 regression results 260–61 topic modeling 259–60 retweets 233–4, 239–44, 257, 260–61, 263 tweets 233–4, 236, 239–57, 259–63 URLs 239–40, 261–3 word cloud 250, 254 words in tweet positively associated with retweet time 239 numerical analysis and values 90, 96 NumPy 171 OakNorth 167 Obaid, K. 15, 25 Obermeyer, Z. 151 object tagging 173 OCPUs 166 OCR skills 167 optimization algorithms 78, 296 Oracle 165, 172, 173, 202 Analytics 175–6 AutoML 166 Autonomous Data Warehousing 174–5 Business Intelligence 41 Cloud Data Science platform 166 Cloud Infrastructure 166, 175 OrangeHRM 202 overload of data 6 p-value 98–9, 113, 129, 131 PageRank 244 Panapakidis, I.P. 23 pandas 171 Papadimitriou, T. 22 Parallel Distributed Processing 153 parallel outcome trends 106 partial least squares (PLS) model 14, 15 pattern recognition 98, 297, 301 pay-per-use model 173, 177 Payne, A. 286 PayPal 119
Index 319 PDP Research Group 153 Peffers, K. 6, 215, 221–2 Pennsylvania University 161 people analytics 120, 180, 184, 186, 195–6, 198–200, 202, 207 People Inside 201 Pepperstone 167, 169–70 performance appraisal process 184–5 performance data 188 performance measurement see data-driven analytics (DDA), HRM and organization performance measurement performance-based task 129 performance-focused processes 36 personal data protection 187, 188, 267–76 EU GDPR 269–70 interconnection of personal data segments 273 Personal Data Protection law 271 Personal Identifiable Information (PII) 267 privacy 267, 268–9, 275–6 right to be forgotten/to be left alone 270, 274 taxonomy of privacy based on family resemblances 268–9, 272, 275 see also Complex Adaptive Systems (CAS) model analysis personal digital assistants (PDAs) 120 Petkov, R. 46 Pham, T.T.X. 18 Pickard, M.D. 44 Pitts, W. 153 Plakandaras, V. 15–16 Planned Behavior Theory 206 Plum 183 Poland 23 Polites, G. 222 Popovich, K. 286 Potočnik, P. 23 Power BI 41, 201 power law degree distribution 218, 223 Prassl, J. 184 pre-emptive analytics 4, 5 predictive analytics 2–5, 295–6, 297, 298, 299, 300 accounting 38–9, 41 for churn 74 CRM systems and AI 279 data science ecosystem and public cloud 169 data-driven analytics (DDA) and HRM 200 e-commerce 87, 89, 100 fashion retailing 74, 78, 79 financial prediction 13 marketing analytics 52, 53, 58 social network analysis using design science 214, 226
predictive analytics and decision-making 117–24 benefits 118–21 challenges 122–3 future implications 124 recommendations 124 predictive analytics for machine learning and deep learning 148–62 data science defined 149, 150 deep learning 152–60 machine learning 149–51, 154–60 use cases 160–62 prescriptive analytics 4–5, 38–9, 295–6, 297, 298 accounting 38–9 data-driven analytics (DDA) and HRM 200 e-commerce 87, 100 fashion retailing 78–9 social network analysis using design science 214 previous findings and context, review of 2–3 primary data 4 principal component analysis 14, 160, 169 PRISMA framework (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 138 privacy and confidentiality 39, 68, 81–2, 176–7, 187, 291, 304 see also personal data protection probabilistic models 78–9 problem definition 2 framing or identification 3, 6, 221, 222 solving 2, 6, 37, 47, 176, 215, 221 processing data 6, 35, 124, 152, 172 proprietary tools 41 Pruitt, S. 15 psychometric functions (PFs) 128, 129, 130–31 psychophysical detection tasks (PDTs) 128 public cloud see data science ecosystem using public cloud Pukthuanthong, K. 15, 25 Python 42, 87, 172, 202 QlikView 80, 92, 301 qualitative approach 54, 68, 149, 177, 182, 202, 305 quality of data 6, 58, 61, 64, 90, 168 quantile regression forest 22 quantitative approach 36, 54, 68, 90, 149, 177, 305 R (programming language) 42, 87, 171, 202 radial basis function (RBF) 18, 134–5, 138 radio-frequency identification (RFID) 43 random forest 14, 18, 19, 22, 300 random walk 15, 16, 17, 21, 24
320 Handbook of big data research methods Rangaswamy, A. 2 Ransbotham, S. 1 Rao, C. 19 recommender system 158 regression 78, 79, 80, 87, 155, 158, 260–61 analysis 5, 72 descriptive analytics 297, 299, 301 linear 14, 18, 135, 155, 299 logistic 18, 19, 155–6 model 41, 74, 117 multiple 299 multiple predictive 14 ordinary least squares (OLS) 13–14, 26, 299 principal component 14 segmented 107 tasks 134 trees 14, 18 weighted least squared 14 regularization methods 13 Reliance Capital 223 Reliance Jio 94 Reserved Instances and Savings Plans 173 resource-based view (RBV) 181, 206 retail sector 119, 161 RFE (recursive feature elimination) 169 right to be forgotten/right to be left alone 270, 274 robots 25, 79, 81, 119 Rokach, L. 157–8 Rumelhart, D.E. 153–4 Rutgers Business School 43 Saboo, A.R. 5 safety analytics 304 Sage 43, 45, 47 Salehan, M. 3 Salesforce Automation (SFA) 281, 290 Samuel, A. 152–3 Sangle, P.S. 289 SAP 202 Success Factors People Analytics (SFPA) 184 Saratoga Institute 201 SAS Business Intelligence 41 Sassani, F. 305 Saura, J.R. 279 Scala 172 scalability 109, 151, 166, 174, 177 scatter plots 91 Schläfke, M. 44 Schlittenlacher, J. 142 Schüssler, R.A. 15 Science Direct 75 Scopus 75 scorecards 202, 205–6, 303
secondary data 4 security 171, 172–3, 174, 176–7 Sedol, L. 153 segmentation 53, 57–8, 65, 91, 100, 175–6 selection approach 19 selective sampling, stream-based 142 self-observation method 158 self-organisation 274 self-regulation, ethical 270 semi-structured data 4, 34, 73, 76, 86, 181, 295, 304 data science ecosystem using public cloud 167, 170, 172, 174 sensitivity analysis 74–5, 76, 82, 132 sentiment analysis 5, 161, 263, 299 sequence rules 91 Settles, B. 142 Shah, N.D. 161 Shannon, C.E. 142 Shanthikumar, G.J. 149 Shao, Z. 22 Sharma, A. 290 Sharma, R. 5–6 Shaw, J. 52 Shell 195 Shi, H. 152 Shipman, J. 176 Sigrist, F. 12, 18, 19 silhouette methods 300 silos 53, 170–71 Simon, H.A. 6, 215, 220 simulation modeling 78, 297 Singh, J.P. 5 Sirignano, J.A. 12, 25 ‘six degrees of separation’ 216–17 skewness 77 Slinnainmaa, S. 154 Slovenia 23 Smart Contracts 40–41 Smith, G.F. 222 social media 2, 4, 5, 25, 54–5, 73, 92, 181 e-commerce 88, 92, 98, 99 predictive analytics for machine learning and deep learning 151, 161 see also Notre-Dame fire and Twitter social network analysis (SNA) 92, 243–5, 263, 299 social network analysis (SNA) using design science 214–28 complex networks: properties and structure 216–20 complex networks: scale-free property 217–18 credit networks in Indian companies and banks (case study) 215, 223–6
Index 321 analysis and results 225–6 data collection 224–5 giant component 216 implications and limitations 226–7 methodology 220–23 background to DSR 220–21 process model (six-step) 221–3 modular structure 219–20 modules in unipartite and bipartite networks 220 small world effect 216–17 social networks 53, 223, 261 Solove, D. 268–9 Son, J. 236, 239 Song, X.D. 142 specificity 74–5, 132 speech analytics 92 speech recognition 33, 152 Spiess, J. 12 sports industry 161 spreadsheets 33, 301 SPSS 201 stacked denoising autoencoders 21, 22 standard deviation (SD) 77, 96, 98, 112, 132, 296, 301 Starbucks 181 State Bank of India 223 statistical analysis 36, 52 statistical goodness of fit 129 statistical inference 130 statistical models 86, 87, 117, 132, 148–9, 169, 201, 299 statistical significance 111 stepwise process 6 stimulus intensity 130 stochastic process 132 Stone, B. 233 storage of data accounting 35, 37 data science ecosystem using public cloud 172, 177 fashion retailing 73, 75, 77 predictive analytics and decision-making 122, 123, 124 predictive analytics for machine learning and deep learning 149, 152, 154 storytelling 5, 69, 91, 202, 205 strategic decision-making 99, 148, 181, 195 stress testing 175 structural equivalence 222 structure of data 168, 182 see also semi-structured data; structured data; unstructured data structured data 4, 33–4, 36, 44, 90, 181
data science ecosystem using public cloud 170, 172, 174 descriptive analytics 295, 304 fashion retailing 73, 76 structured relationship databases (SQL) 77 Suh, B. 240 supply chain management (SCM) 86 supply chain planning (SCP) 303–4 support vector machines (SVMs) 18, 19, 21, 22–3, 78–9, 156–7, 161, 300 support vector regressor (SVR) 16, 299 Switzerland 18 syncretic cost-sensitive random forest (SCSRF) 19 Szoplik, J. 23 T-Mobile 181 Tableau 41, 80, 87, 92, 93, 95, 202–3, 301 tables 296 Tambe, P. 181 Tamošiūniene, R. 289 Tan, F.T.C. 2 Tang, X. 19 Tanner, J.F. 279 Target 119 Tavakoli, M. 74 temporal data 105–7, 109, 111 Tesco 119 Tesla 152 text classification 15, 80, 88 text mining 5, 299 third parties 42, 188–9 threshold 128 logic 153 value 129 Tibco 87 Tierney, B. 149 time-series modeling 15, 72, 105, 151, 166 see also temporal data time-varying effects model 5, 12 Tobit models 18 training data 147, 155 Tranfield, D. 297 tree-based classifiers 18 treemaps 90 Troisi, O. 282 Trustev 2 Tukey, J.W. 152 Turkey 23 Twitter 4, 42, 89 see also Notre-Dame fire and Twitter Uber 42, 184 Unified Marketing Measurement (UMM) 59 unipartite networks 227
322 Handbook of big data research methods United Kingdom Information Commissioner 270 United States 15, 18, 22, 26, 123, 142, 185, 271 unstructured data 4, 13, 33–4, 36, 117 data science ecosystem using public cloud 167, 170, 172, 174 descriptive analytics 295, 300, 304 e-commerce 86, 89, 90, 92, 95 fashion retailing 73, 76 HR analytics 181, 182 USC 183 user-level attribution data 65 Vaishya, R. 151 validity 12, 107, 113, 188 value 4, 12, 34, 72, 181, 214, 295 Van den Heuvel, S. 196 Varetto, F. 12 variety 6, 11–12, 34, 72, 91, 181, 214, 295 Vasarhelyi, M. 44 VECM model 22 vector autoregressive models 15 vector error correction 22 Velcu, O. 202–3 velocity 6, 11–12, 34, 72, 181, 214, 295 veracity 12, 34, 214, 295 Viollet-le-Duc, E. 241 Virgin Atlantic 223 visualization 33, 41, 42, 149, 168 data science ecosystem using public cloud 168 data-driven analytics (DDA) and HRM 196, 202–3, 206 descriptive analytics 301–4 descriptive analytics and data visualization in e-commerce 92 fashion retailing 80 social network analysis using design science 214, 216 see also descriptive analytics and data visualization in e-commerce; storytelling voice recognition 98, 152 volatility 12 volume 6, 11–12, 34, 72, 181, 214, 295 Vrontos, S. 15
Waller, M.A. 148 Walmart 45, 86, 94, 151 Wang, F. 12 Wang, L. 22 warehousing 6, 77, 86, 149, 172, 174, 289 Watson, R. 222 wearables 120 weather forecasting 162 web analytics 53 Weber-Fechner law 128 website morphing algorithms 76 Wei, Y. 74 whisker plots 168 Williams, E. 233 Wittgenstein, L. 268–9 Wong, E. 74 Workday 202 workforce analytics 198, 200 XGBoost 18, 22 Xiang, Z. 3 Xin, Y. 161 Xu, J. 304 Yacchirema, D.C. 304 Yahoo 34 Yang, Q. 141 Yao, X. 19 Yes Bank 223 Yigitbasioglu, O.M. 202–3 YouTube 151, 158 Yu, L. 18, 21, 22 Zechner, J. 15, 25 Zhang, J. 34 Zhang, W. 162 Zhang, X. 15 Zhao, Y. 21 Zhong, R.Y. 5 Zhou, Z.H. 150 Zoho Analytics 41 Zoho People 201 Zoom 290