From Big Data to Intelligent Data: An Applied Perspective 3030769895, 9783030769895

This book addresses many of the gaps in how industry and academia are currently tackling problems associated with big da

233 44 6MB

English Pages 122 [121] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Acknowledgments
Contents
About the Author
1: Introduction
1.1 The Business Value Proposition
1.2 The Enhanced Value
1.3 The Age of Big Data
1.4 From Business Intelligence to Business Analytics
1.5 Dynamic Process Flow
1.6 A Paradigm Shift
1.7 Evolving Technologies
1.8 Data Modeling: Structured or Unstructured
1.9 Much Information Little Intelligence
1.10 Measuring Information: Bits and Bytes
1.11 The Competing Vs of Big Data
1.12 The Competitive Edge
2: High Fidelity Data
2.1 The Telephone Game: Data Sourcing and Transmission
2.2 From Audiophile to Dataphile
2.3 Interference and Data Contamination (Signal-to-Noise)
2.4 Monitoring, Detecting, Resolving, and Reporting
Monitoring
Detecting
Resolving
Reporting
3: Connecting the Dots
3.1 The Internet of Things (IoT)
3.2 Data Aggregation
3.3 The Golden Copy
4: Real-Time Analytics
4.1 Faster Processing
4.2 Analytics on the Run
4.3 Streaming Data
5: Predicting the Future
5.1 A Crystal Ball
5.2 Machine Learning and Artificial Intelligence
5.3 Smart Reporting and Actionable Insights
Data Context
Units, Scales, Legends, Labels, Titles, and References
Data Presentation
5.4 Codeless Coding and Visual Modeling
6: The New Company
6.1 The Mythical Profile
6.2 Organizational Structure
6.3 Software and Technology
7: Data Ethics: Facts and Fiction
7.1 Virtual or Fake Reality
7.2 Privacy Matters
7.3 Data Governance and Audit
7.4 Who Owns the Data
7.5 The Coming of COVID-19
8: Role of Academia, Industry, and Research
8.1 Revamping Academia
8.2 Bridging the Gap
8.3 STEAM for All
8.4 A Capstone Template
Recommend Papers

From Big Data to Intelligent Data: An Applied Perspective
 3030769895, 9783030769895

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

Management for Professionals

Fady A. Harfoush

From Big Data to Intelligent Data An Applied Perspective

Management for Professionals

The Springer series Management for Professionals comprises high-level business and management books for executives. The authors are experienced business professionals and renowned professors who combine scientific background, best practice, and entrepreneurial vision to provide powerful insights into how to achieve business excellence.

More information about this series at http://www.springer.com/series/10101

Fady A. Harfoush

From Big Data to Intelligent Data An Applied Perspective

Fady A. Harfoush CME Business Analytics Lab Loyola University Chicago Chicago, IL, USA

ISSN 2192-8096 ISSN 2192-810X (electronic) Management for Professionals ISBN 978-3-030-76989-5 ISBN 978-3-030-76990-1 (eBook) https://doi.org/10.1007/978-3-030-76990-1 # The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG. The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Preface

It is hard nowadays not to come across a job description or a job title that does not include a reference to data science, data engineering, big data, business analytics, artificial intelligence, or machine learning. Academic programs and certificates are no exception. Are these job titles, academic programs, and certificates purposely making a distinction when selecting a title or naming a program? What is the difference between a data scientist and a data engineer? Is business analytics different from business intelligence? After receiving my PhD in Electrical Engineering and Computer Sciences (EECS combined at the time) from Northwestern University, my first job was at Fermi National Accelerator Lab (FNAL). The offer came a bit as a surprise. FNAL is among the prestigious national labs for high-energy particle physics. What do engineers like me have to offer? It turned out a lot. I could not have wished for a better place to start my career. The opportunities to explore, the capabilities available, the interactions, the international collaborations, all the dynamics were as interesting, and intellectually fulfilling. There were endless entertaining debates among physicists and engineers on their role and their added value. I remember the anecdote during an interview, when Leon Lederman, who received the Nobel prize in physics in 1988, and Director Emeritus of Fermi lab back then was asked the question “What is the difference between a physicist and an engineer?” His answer was very telling and summarized it all “A physicist conducts ten experiments, nine fails and one succeeds, and he/she gets the Nobel prize. An engineer builds ten bridges, nine succeeds and one fails, and he/she gets sued.” In short, physicists, scientists, engineers, and researchers, whatever the titles are, they all have a role to fulfill. Many times, we get distracted by the titles, the labels and we forget about the actual roles to be assumed, and the tasks to be accomplished. The idea of this book came after recognizing a big gap in how industry and academia tackle the problem of big data. Many references are old books in statistics re-published with a new title and slightly modified or added content to fit new trends in the industry. They lack the much-needed practical industry insights. Among my first tasks when I joined academia was to develop a new course in business analytics. I struggled to identify a good resource with an applied industry perspective which addresses some of the most pertinent issues, challenges, and opportunities surrounding big data which I believe are relevant and essential. v

vi

Preface

By now many have memorized the four Vs of Big Data (Volume, Velocity, Veracity, and Variety) but how do the Vs interact and compete? What about the most important V (later in the book)? We are overwhelmed with the rapid advances in technology and the many tools to choose from such as R, Python, SAS, SQL, NoSQL, Hadoop, Scala, Spark, MapR, Databricks, etc. These are jargons best suited for programmers, systems developers, infrastructure engineers, and technologists. Most associate the topic of big data with intelligent information and actionable insights. But what is exactly intelligent information, how to define, how to measure, and how to assess its business value? What is the difference between business intelligence and business analytics? Big data is not new and many (like me) who worked at national research labs have been dealing with big data for years. What most books fail to address are some of the most compelling challenges and rewards of big data: connecting the dots, building a unified system of systems, and deriving the so-called golden copy. This book will address many of the missing gaps, introduce some novel concepts, describe the end-to-end process, and connect the various parts of the puzzle for a holistic view. Visuals are used to explain complex ideas with relatively simple diagrams. This book is intended to be a practical guide and a good companion for many in the industry working with large data sets, and those supervising or participating in projects involving big data (irrespective of the job titles!). It is tailored for the generalists who want to see the big picture, understand the process, ask the right questions, and be cognizant of the many pitfalls and the many rewards. This book is not for the specialists interested in enhancing their skills in a particular area like programming or database design. Important takeaways are highlighted in grayed boxes. The book purposely avoids the many technical jargons, abbreviations, and mathematical formulas. There is only one equation in the book! The aim is to explain important concepts to a wide audience using language, diagrams, examples, and analogies easy to follow and to grasp. Many examples stem from my own experience working in the industry. It is a good reference book mainly for B-schools with a program intended for professionals such as an EMBA/MBA, a specialized Master program or a certificate in the field. The book can be also leveraged for experiential learning with applied lab work and involving capstone or practicum projects. The last chapter discusses the role of academia, industry, and research, and ways to bridge the gap. The book ends with a walk-through, sort of a blueprint for a capstone project or practicum best fitting a program in business analytics. Chicago, IL, USA

Fady A. Harfoush

Acknowledgments

I want to thank my long-time work colleague and friend Jean-Marc Reynaud for the many inspiring intellectual conversations, and the food for thought sessions. Some of the concepts discussed in this book are the results of these interactions. Special thanks to my wife Leyla for her thorough edits and suggestions, and to my children Ryan, Danny, and Kyle who never stopped believing in me.

vii

Contents

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 The Business Value Proposition . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 The Enhanced Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 The Age of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.4 From Business Intelligence to Business Analytics . . . . . . . . . . . . 1.5 Dynamic Process Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6 A Paradigm Shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.7 Evolving Technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.8 Data Modeling: Structured or Unstructured . . . . . . . . . . . . . . . . . 1.9 Much Information Little Intelligence . . . . . . . . . . . . . . . . . . . . . 1.10 Measuring Information: Bits and Bytes . . . . . . . . . . . . . . . . . . . 1.11 The Competing Vs of Big Data . . . . . . . . . . . . . . . . . . . . . . . . . 1.12 The Competitive Edge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1 1 2 4 6 8 12 13 14 18 21 23 26

2

High Fidelity Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 The Telephone Game: Data Sourcing and Transmission . . . . . . . 2.2 From Audiophile to Dataphile . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Interference and Data Contamination (Signal-to-Noise) . . . . . . . . 2.4 Monitoring, Detecting, Resolving, and Reporting . . . . . . . . . . . .

31 31 36 36 40

3

Connecting the Dots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 The Internet of Things (IoT) . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Data Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 The Golden Copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

49 49 49 54

4

Real-Time Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Faster Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Analytics on the Run . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Streaming Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

57 57 59 60

5

Predicting the Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 A Crystal Ball . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Machine Learning and Artificial Intelligence . . . . . . . . . . . . . . . 5.3 Smart Reporting and Actionable Insights . . . . . . . . . . . . . . . . . 5.4 Codeless Coding and Visual Modeling . . . . . . . . . . . . . . . . . . .

. . . . .

63 63 64 68 76 ix

x

Contents

6

The New Company . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 The Mythical Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Organizational Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3 Software and Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . .

81 81 83 84

7

Data Ethics: Facts and Fiction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1 Virtual or Fake Reality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2 Privacy Matters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.3 Data Governance and Audit . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.4 Who Owns the Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.5 The Coming of COVID-19 . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

87 87 91 93 96 99

8

Role of Academia, Industry, and Research . . . . . . . . . . . . . . . . . . . 8.1 Revamping Academia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Bridging the Gap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3 STEAM for All . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.4 A Capstone Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

101 101 102 106 108

About the Author

Fady A. Harfoush completed his high school education at the Lycée Franco Libanais part of the Mission Laïque Française in Beirut, Lebanon. He earned his undergraduate studies in Electrical Engineering at Bogazici University in Istanbul, Turkey. He pursued his advanced graduate studies at Northwestern University in Illinois, USA, where he received his PhD in Electrical Engineering and Computer Science (EECS) with a minor in Applied Mathematics. His work experience is diverse, starting as a research engineer at Fermi National Accelerator Lab (FNAL) and an invited scientist at the Conseil Européen pour la Recherche Nucléaire (CERN). His early publications and research work are well documented and still cited to this day. After 6 years in research, he transitioned to work at lead financial firms holding different positions as applied quant, financial engineer, technology manager, and director. He is the co-founder of Social Market Analytics (SMA), a leader in harnessing the massive amount of unstructured social and financial data. He is the holder of three patents in the field, with more pending. In his most recent role in academia, he is the inaugural director of a business analytics lab, and an executive lecturer. Fady speaks four languages fluently. He enjoys exploring the world and different cultures. His favorite retreat place is somewhere along the Aegean Sea.

xi

1

Introduction

1.1

The Business Value Proposition

Twitter generates an average of 500 million tweets per day. This number is increasing by the day. Getting access to Twitter’s full daily tweets (the firehose) is costly, and limited to those who really need it, and can afford it. It is estimated that access to the firehose costs somewhere in the few hundred thousands of dollars per year. Is this a reasonable price to pay? What do we get in return? Is this a good business value proposition for Twitter? Do the benefits or business rewards outweigh the costs? What is the value added for a business to access the full firehose of tweets if the company does not have the infrastructure and the know-how to capture, and analyze the tweets to extract the valuable insights—and do it in almost real time? Not a small undertaking. Before rushing and jumping on the big data bandwagon these are some of the questions every business should address first, both from the technological and from the business perspectives. It is safe to assume that only 1% of all daily tweets contain valuable or intelligent information that can be translated into actionable insights. What is the added business value to access more than 500 Million tweets per day and what is a good price? If 99% of big data is dirty data, which remaining 1% is good data? Finding the 1% is like searching for a needle in a haystack. How is a business supposed to assess the price of a subscription to the tweets, and the return on the investment if one cannot easily distinguish between a valuable or true tweet and a junk or fake tweet? Should a consumer be charged for the 1% only and which 1%? Who decides the right percentage? We should be charged for the quality of service we receive. Not the case with data. In almost all cases the data provider or vendor will have a disclaimer of the sort “XYZ assumes no responsibility for errors or # The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 F. A. Harfoush, From Big Data to Intelligent Data, Management for Professionals, https://doi.org/10.1007/978-3-030-76990-1_1

1

2

1 Introduction

omissions. The user assumes the entire risk associated with its use of these data.” A good comparison, albeit in a different context, is when purchasing fruits from a grocery store. The price is set by the weight, and not by how much juice can be extracted from the fruit. In principle the price should be set by the amount of juice content. This may sound like a far-fetched scenario but thanks to evolving new sensing technologies, it is possible in the foreseeable future we could be charged by the content of juice, and not by the weight. Clearly a business model based on providing access to the tweets’ firehose is not a good business value proposition for everyone. The real value proposition resides in the analytics and actionable items that can be extracted from the tweets. For these reasons, Twitter has partnered with few in the industry among which is my co-founded SMA (www.socialmarketanalytics.com) to help businesses in different sectors access the analytics derived from the tweets.

1.2

The Enhanced Value

A good introduction to the topic is a scene from the movie “The Circle” (2017). In the scene the lead actor, played by Tom Hanks, is presenting the company’s new product (a cheap tiny camera with real-time broadcasting capabilities) to the employees and the new recruiters. During the presentation, he quotes two details critical to creating the enhanced value. The first quote is about linking the information from different sources to create a unified view by which an enhanced level of intelligent information is obtained. This ties very well, as we will see in later chapters with the topic of IoT (Internet of all Things). “Knowing is good, but knowing everything is better” The second quote emphasizes the need to process the data and run the analytics in real time. “Real Time Analytics Process” Both quotes have major implications we will discuss throughout this book. They represent the biggest challenges and the most compelling competitive edge: the ability to link the data from the different sources to create a unified view, and to run the analytics in almost real time. The world has evolved from having limited and controlled information, to having unlimited and open information. Interestingly both outcomes are equivalent considering that most of big data (99%) is dirty data. In many applications (i.e., engineering) the signal-to-noise ratio (SNR) is a good metric to measure performance and

1.2 The Enhanced Value

3

Fig. 1.1 The business value proposition

assess quality. Simply put, we want to enhance the signal (the numerator) and reduce the noise (the denominator) to achieve a high SNR. The SNR can be viewed as a representative measure of good-to-bad data, with the signal representing the good data and the noise the bad data. The two extreme cases of very small limited (close to zero) good data and the case of very large unlimited (close to infinity) bad data lead to a SNR close to zero with little or no real business value. Without the proper tools and the know-how to separate the good data from the bad data, having very limited (close to zero) data or unlimited (close to infinity) data both provide little or no intelligent information. What industry needs are the tools and the know-how to mine the unlimited information, connect the dots, and do it almost instantaneously. Those able to do so will gain the competitive edge, maintain the industry superiority, and eventually create a monopoly. A familiar example is Google. A less publicly known company called Palantir (www.palantir.com) has for years specialized in creating a unified view leveraging data from different data sources. It is used mainly by government agencies. The company went public in September 2020. Throughout this book we will address the fundamental question described in Fig. 1.1: how to turn big data into intelligent data, to extract the actionable insights, and to do it fast. Information here is used in general terms to convey data, to share an opinion, describe an event or an observation. It is not a statement of intelligence. We will later explain the distinction.

4

1 Introduction

Information does not necessarily imply intelligent information! Questions raised earlier about Twitter’s business value proposition apply to other social media channels such as Facebook, Instagram, and LinkedIn. What is their business value proposition? Is it the platform or is it the data? The data represent a major part of these businesses and their valuations. Data is a dirty business, and content is king. But how to monetize data? What data and what content are we competing and paying for?

1.3

The Age of Big Data

Welcome to the age of big data. Big data has been described as the new oil, the new gold rush, and at times compared to the advent of the Internet revolution. While to some extent these analogies are correct there is a level of exaggeration, coupled with the lack of historical context, and a hype associated with the rush to capture the market opportunities. To begin we need to set the record straight on what big data is and is not and put matters in the right perspectives. Many private and national research labs have for years been working with what we now call big data. Drawing from my own experience, examples can be cited from research work in high energy physics at places such as Fermi National Accelerator Lab (FNAL) in Batavia, Illinois, and the European Centre National des Recherches Nucléaires (CERN) located on the border between Switzerland and France. Experiments conducted at these particle physics accelerator labs have the mission to search for new particles and confirm (or reject) integrity of established theories (i.e., the standard theory in physics). The amount of data collected from experiments conducted at these labs easily range in the hundreds of petabytes (250 bytes) per year. It can sometimes take more than a year to analyze the data using high performance computing servers to detect any traces of a new particle. It is like looking for a needle in a haystack. Many other national labs in the USA (Sandia, Lawrence Livermore, Argonne, Jet Propulsion Lab, etc.) and globally (CERN) have also for years been working with big data. Similar experiments, large data collection and analysis are conducted in astronomy in the study of cosmos, and in searching for signs of extraterrestrial life. In non-government private industries, Walmart, for example, has been in the business of collecting and analyzing big data for years looking at transactional purchases. The financial industry has been working with big data analyzing years of historical stock “tick” data collected sometimes in the microsecond time interval range. One can quickly appreciate the large amount of data collected. In summary it is safe to say that big data is not a new phenomenon and has been around for many years.

1.3 The Age of Big Data

5

Fig. 1.2 Big data sources

So, what is new? Like many knowledge and technology transfer between research labs and industry, what makes big data new is its democratization and its commercialization. Its wide adoption has been facilitated by the rapid advances in technology making it cheaper and easier to generate, collect, and analyze data. Big data is not a new phenomenon. What is new is the democratization, the commercialization, and the wide adoption of big data. As depicted in Fig. 1.2 data is now generated by many sources like social networks, mobile devices, and smart sensor technologies used in IoT (Internet of Things). Quite often the data collected is made available freely for everyone to view and to analyze. The real value resides not in the data itself, but in the intelligent information extracted from the data. According to the World Economic Forum (WEF) report published in April 2019 (wef-2019), it is estimated that by 2025 the amount of data produced per day globally will surpass the 400 exabytes. The entire universe is expected to reach 44 zettabytes in 2020. Facebook generates about 4 petabytes of data each day including images and videos. To put matters in perspective, a gigabyte (1 Gb) is equivalent to one thousand average quality images (1 Mb/image), or 200 higher quality images (5 Mb/image). A petabyte is 1 Gb followed by 6 zeros, and a zettabyte is 1 Gb followed by 12 zeros! The numbers are overwhelming. In the

6

1 Introduction

foreseeable future, we will be talking of yottabytes. In summary there is no shortage of data available. There is no shortage of data. The questions for every business and for every consumer are what to do with so much data, how to evaluate, and how to leverage.

1.4

From Business Intelligence to Business Analytics

With the commercialization of big data come new challenges and new opportunities. In the new digital age, the economics separating a producer from a consumer is more blurred. The customer is not just a consumer of a product anymore, but also the producer. As an old saying goes “if you are not paying for it, then you are the product.” The product is the data generated by the consumer. This entangled relationship raises interesting questions, and presents new economics affecting our traditional understanding of the marketing sector, and the interaction between consumers and producers. To understand the commercial impact of big data, we need to introduce the concept of business analytics. Data being the new commodity, what values or benefits we can expect in exchange of the data we produce and consume? Many of the benefits come not in terms of monetary reward, but in better customized services tailored to suit our needs and tastes. In defining business analytics there are three main contributing components represented by the circles: business, analytics, and intelligent information. Every business sector is affected by big data. No sector is immune. The analytics spectrum is wide from the simple descriptive to the more complex prescriptive. The third and most important component is intelligent information. Irrespective of the business sector, no matter how fancy and advanced the analytics are, unless we can extract the actionable insights, identify the value added to a business, and show how to gain the business competitive edge, we have little to offer. As noted in Fig. 1.3 what distinguishes business analytics is the streaming real-time nature of big data. An important characteristic of big data is the 24/7 real-time streaming nature of the data. Data never sleeps! Throughout this book, we will examine how streaming data is affecting the ways many problems are tackled. Unlike Business Analytics (BA), Business Intelligence (BI) has been around for a long time. Many businesses still depend on internal builtin legacy systems, established rules and processes for their internal business intelligence reporting. So, what is the difference between BA and BI? This is best illustrated in Fig. 1.4.

1.4 From Business Intelligence to Business Analytics

7

Fig. 1.3 Business analytics

Fig. 1.4 Business Intelligence (BI) versus Business Analytics (BA)

BI tends to be past looking with a focus on historical data. It is more static, passive, descriptive, reactive, and explanatory in nature. A typical example is a retail store looking at past inventories and sales. By studying historical data over the last

8

1 Introduction

six-months’ period, the retail store can decide what products and merchandises to have available at different store locations over the next coming period. Studies are usually conducted using approved methods and tools to insure conformity within and across the company’s different locations and branches. Results are presented in the form of well-structured BI reports. Such work is more reactive in nature and the statistics tend to be more of the descriptive type. Reports attempt to answer the questions of what happened, where, and when. Business analytics, on the other hand, is more forward looking, derived from new, constantly changing, and streaming data. It is more dynamic, adaptive, proactive, and predictive in nature. It goes beyond what, where, and when, to ask the questions of why, how, and what next. The analytics are more granular at the consumer level, as opposed to a group or general crowd level. Online stores such as Amazon and Netflix look at a particular consumer purchase habit or web browsing, and accordingly suggest the next potential product to view or buy. Business intelligence analysis might, for example, ask the question of how many borrowers missed their last month payment, whereas business analytics is interested in the likelihood a borrower will miss their next month payment. Business Intelligence (BI) is past looking, more passive, reactive, and static in nature. Business Analytics (BA) is forward looking, more proactive, predictive, and dynamic in nature.

1.5

Dynamic Process Flow

Over the years many in the industry invested heavily to develop their own internal frameworks and processes to generate custom BI reports. A typical BI process flow follows a standard framework known as the CRISP-DM (Cross Industry Standard Process for Data Mining) diagram, which was first introduced in 1996. A representative diagram is shown in Fig. 1.5. Looking at the CRISP-DM diagram, what is important to note are some of the changes that have occurred since the advent of big data. The main six phases identified in the diagram still apply. Of interest are the reversible arrows between some of the phases, in particular the long arrow that goes from the evaluation phase (sometimes referred to as prod-staging or beta) to the early phase of business understanding. • The implications of reverting from the evaluation phase to the business understanding can be very costly considering the efforts (time, human resources, and budget) already invested in the intermediate phases. • The CRISP-DM diagram assumes the data resides in a traditional database (aka relational), in the sense of storing, extracting, and querying the data. This can be a centralized database as part of an enterprise solution, or a distributed data server.

1.5 Dynamic Process Flow

9

Fig. 1.5 Typical BI process (CRISP-DM) (https://en.wikipedia.org/wiki/Cross-industry_stan dard_process_for_data_mining)

The streaming nature of big data requires the adoption of new techniques and more agile development techniques. The data is not necessarily stored in a relational database and can instead be consumed directly from streaming external sources. The dynamic nature of the data requires more iterative loops frequently executed between phases to adapt, test, and assess integrity of each phase. Methods of intervention to correct for errors should equally allow for multiple iterations and self-corrections. Even prior to the advent of big data, many concepts in agile development and testing methods have been adopted to ensure a level of integrity and quality early in the process, and prior to the final evaluation or pre-production (prod-staging) integrated testing. Agile development helps identify problems at an early stage and avoid reverting from the late evaluation phase to the early business understanding phase as demonstrated in the CRISP-DM model. Such a revert can be very costly for a business. A good example of a process flow, better suited to streaming big data and business analytics, is a dynamic cyclical flow as represented in Fig. 1.6. To note are:

Fig. 1.6 Cyclical process flow (https://www.ibm.com/blogs/think/2018/02/data-science-design/)

10 1 Introduction

1.5 Dynamic Process Flow

11

• The model in the process flow is no longer a finished product. Instead, we have a dynamic model that constantly adapts to changing conditions as more and new data are collected. • The cyclical loops allow for the early detection of problems and correction. • The data is not necessarily residing in a centralized database suggesting a dynamic cyclical loop of business questions, revision of the business understanding, and more frequent deployment. The traditional CRISP-DM process flow for data mining is more suited for BI projects. A dynamic, cyclical process where a model is never finished, a database never complete, and deployment more frequent is best suited for BA projects. Migrating from a BI oriented solution to a BA type solution poses many new challenges especially to organizations with established internal procedures and legacy solutions. To begin, it requires fundamental changes to how enterprises operate: work culture, resourcing, vision, procedures, and hierarchical organization. Questions of what models to use, when to modify, how to monitor, how to govern, and how to audit require adoption of new methods. A cyclical process flow with many iterations requires close coordination and frequent interactions among multiple teams. Such coordination becomes challenging when teams are working remotely, scattered across geographical regions, and in different time zones. To demonstrate the significance and impact of migrating from BI to a more BA fitting solution, let us consider the one scenario in relation to auditing. Auditing, in simple terms, is the ability to trace or walk back in time to recreate a particular outcome, a scenario, or a result that occurred in the past. Most important factors in any auditing are the data and the models used. If the data is constantly changing and the model is constantly learning and adapting, how do we conduct a data and model auditing? Auditing with both the data and the models changing is not an easy task anymore. It requires the capturing and re-creation of both the data and the model at each moment or time step in the past. Is it possible? How far back in time do we need to go? Before any enterprise rush into adopting new technologies and embrace new methods of work, it must first assess the costs and rewards, the business benefits, promote a culture and foster an environment best fitting to business analytics. Otherwise, there can be a detrimental effect on the company’s success and market competitiveness in the long run. Although many published surveys show a significant rise in the percent of companies who want to leverage big data and new technology, many of the surveyed also expressed concern on how to go about implementing the necessary changes.

12

1 Introduction

There is no one solution that fits all. Many times, small incremental improvements can provide rewarding results. It is important to understand the costs and benefits, set early expectations with clear milestones to measure success before signing on to big BA projects.

1.6

A Paradigm Shift

The new paradigm shift in how we think and how we work is pushing the emergence of new technology for the handling and rapid processing of streaming big data. The analytics need to equally learn, adapt, and work quickly on fast moving large amounts of data to extract the value, bypassing in many cases the need to store the data in traditional relational databases fitted with structured query language (SQL). These can be challenging tasks, and in some limiting cases (as we will discuss later) unattainable. To highlight the significance of the new paradigm change let us consider an analogy, albeit exaggerated to illustrate the point. A patient requires urgent hospital care. Under current conditions the patient is carried in an ambulance to a nearest hospital where a surgery is performed. In the new paradigm there is no time to go to the hospital. Instead, the surgery must be performed in the ambulance, while travelling at a high speed, and on unpredictable sometimes bumpy or curving roads as demonstrated in Fig. 1.7.

Fig. 1.7 Paradigm shift

1.7 Evolving Technologies

13

The new paradigm shift and the 24/7 streaming data require new methods, new ways of thinking, and new tools for problem solving. It is like performing a surgical operation on a patient who is conscious, and travelling at a high speed on a bumpy and curving road.

1.7

Evolving Technologies

In response to the paradigm shift, new tools and methods of interventions have been and are being developed. A relational database comes with a sequential query language (SQL) to allow for the storing, extraction, and analysis of the data in the database. It is a solution best fitting, in our analogy, the surgical room in a hospital. However relational databases were not designed to work with streaming big data at high speed, and of different format mixing text, numerical, images, audios, and videos (referred to as unstructured data). Building a relational database does not scale and adjust well to changes in data format, speed, and volume. This has led to the emergence of new technological solutions to address such limitations. The NoSQL (Not-just SQL) type database is one example. Solutions like Hadoop, Scala, Databricks provide more adaptable solutions to address questions of dynamic storage and fast processing of streaming big data sets composed mainly of unstructured data (text, video, audio, images). Another example of solutions impacted by streaming big data is the widely used tool to process data known as ETL (Extract, Transform, and Load). By its nature, ETL introduces undesired overhead when processing the data to allow for the three stages of data extraction, transformation, and loading. ETL streamlines the process and works well when the database, the format and variety of incoming data are well defined and not frequently changing, where timeliness is less immediate, and accuracy of the data can be managed. Rules for monitoring, for evaluating, and for intervention to correct a problem can be implemented a-priori like a manufacturing assembly line, where the inputs and outputs are somehow predefined. With streaming unlimited, varied data format, combined with the need to conduct near real-time analytics (presented earlier), many of the old solutions and methods of operation must be revamped. The inputs and outputs are less predefined. NoSQL databases like MongoDB or Cassandra bring scalability, performance, and allow to quickly dump the data into the database (using simple key, value-based tables). This way the data is readily available to view and to analyze. It does however come at a cost. Any speed up in the loading of data usually means a slower data retrieval. It is a careful balancing act. This is partially due to the large size of tables (denormalized), and the lack of a good database design tailored to how data will be extracted. It is often a zero-sum game. The penalty in performance is either paid at the back-end when uploading the data, or at the front-end when extracting or downloading the data. Special purpose hardware-based solutions have been developed for both fast back-end data loading, and fast front-end data retrieval. Examples are the use of SSD

14

1 Introduction

(Solid State Devices) technology optimized for both fast loading and retrieval of data, in-memory solutions, database appliances (Netezza), MPP (Massively Parallel Processing) architecture such as Greenplum, and vertical column-based databases (best for transactional time series data) such as Vertica and Sybase IQ. Other architectural solutions have combined database speed and analytics such as SAP Hanna and Teradata. Most of the new emerging technological solutions stem from the need to accommodate the growing unstructured type of data, and the fast processing of data. This is an area which will continue to grow and witness new solutions as the need for speed, storage, and rapid analytics evolve.

1.8

Data Modeling: Structured or Unstructured

It is estimated that 95% of newly created data is unstructured. In terms of existing data 80% is already unstructured. This percentage continues to grow in favor of unstructured data. The evolution of data growth is portrayed in Fig. 1.8. Data modeling works well when the data is structured. As data grows and is more unstructured, data modeling becomes less obvious. To illustrate the differences let us consider two examples: a transactional structured data and an unstructured textual data. Figure 1.9 is a simple example of a transactional purchase data.

Fig. 1.8 Data growth and variation

1.8 Data Modeling: Structured or Unstructured

15

Fig. 1.9 Transactional purchase data

Structured data makes it easy to identify the entities, records, and fields or attributes. They are well suited for relational databases, and quick prototyping in Excel using a small scaled down data set. In the final real production environment, to process large amounts of data, and for data audit purposes a database is required. The development of an Entity Relational Diagram (ERD) is an integral part of any data modeling and in building a database, or a data warehouse (aka data repositories). It is like the blueprint of a house before it is built. Many other details (not within our scope) impact the building of a data warehouse such as the nature of data, ways to access data both for internal and external consumption, performance, storage requirements, data licensing, access authorizations, credentials, and privileges. A common and relatively simple ERD is the star relational schema. The name derives from a star layout with a center large core table and smaller peripheral tables. A sample star schema ERD corresponding to the transactional structured data example is shown in Fig. 1.10 below. The center core table in a star schema is often referred to as the Fact table. It is usually the biggest table and contains the transactional information associated with each record that makes an entity. Each entity is uniquely identified by its primary key (PK). Peripheral tables are known as Dimension tables. They are smaller in size and best suited for grouping and categorization of data. Dimension tables make it easy to modify, add, remove categories without the need to alter the large Fact table. All Dimension tables are connected to the Fact table for entity relationships identified by their foreign key (FK). ERDs are important to establish business rules for field logic integrity, and for field relationship integrity. Such rules help identify data integrity issues at an early stage during the uploading of data into a table. Field logic integrity violation is about integrity of values entered in a field (attribute). Examples are a text value entered in a table column defined as numerical, a decimal value entered in a column defined as integer, or a non-defined value in a categorical column. Such violations will prevent the data from being loaded. Relationship integrity violation is about integrity of

16

1 Introduction

Fig. 1.10 Star relational schema

relationship between fields or attributes. For example, a child’s age cannot be greater than his/her biological parent’s age. Another example in a different context is that the maturity date of a loan cannot be less than its origination date. In our transactional purchase example, a field integrity violation occurs when, for example, the data contains a region, such as Midwest, that does not correspond to any of the predefined valid values of north, east, west, or south. If the rule is set properly on the table, this will cause an error. A relationship integrity scenario is the case where, for example, the payment type is cash, and the payment source is online. Unstructured data, which represents almost 95% of newly created data, is composed mainly of text, audio, images, and video. Unlike structured data it is not easy to identify the records, entities, and attributes and how to fit within a traditional relational database. Figure 1.11 is an example of unstructured data representing a tweet. When looking at a tweet, it is hard to see any structure to the data and how to derive an ERD. The data can be stored in simple two columns Excel-like sheet composed of key-value pairs. A key is a unique identifier about the tweet, and value is the textual content of the tweet. There is however a limit as to how much data an Excel sheet can store, certainly not enough to deal with millions of tweets. Unlike a relational database, a database composed of key-value pairs is easy to set up. A good example is MongoDB (a NoSQL database). MongoDB is great for quick prototyping and for handling high volume data. It does however come at the risk of data quality degradation. To help mitigate any data quality issues it is recommended to construct

1.8 Data Modeling: Structured or Unstructured

17

Fig. 1.11 Anatomy of a tweet

Fig. 1.12 Tweet in JSON format

a data model. The derivation of an ERD, even for a non-relational database, enforces an understanding of the data, and helps define integrity violation rules that can greatly improve the quality of the data and prior to running the analytics. Jeopardizing quality in favor of speed will most likely result in fast junk-in and fast junk-out. Programmatically tweets are captured in JSON format which is a human-readable text commonly used to transmit and store data objects. JSON is a reminder of XML. The former is data oriented, whereas the latter is document oriented. Example of a tweet in JSON format is shown in Fig. 1.12.

18

1 Introduction

A closer look at the JSON format reveals some structure to the data that can be associated with attributes like language, source, retweet, date and timestamp, links, etc. A tweet can, therefore, be treated as semi-structured data, and one can derive a data model to store the tweets in a relational like database. As discussed earlier, the benefits of data modeling are improved data quality. Data modeling and understanding the relationships between the data attributes is a valuable exercise before conducting any analytics, to insure a level of data integrity. Otherwise, we can easily end up with fast large junk-in and junk-out data. This is true irrespective of the data format, and truer when the data is unstructured.

1.9

Much Information Little Intelligence

We are constantly bombarded with data and information. Big or more data does not necessarily translate into more intelligent information or more knowledge. On the contrary, more data many times implies more junk or noise. A business value proposition should be measured in terms of the intelligent data and actionable insights that can be extracted from the data, and not by the amount of data provided (Fig. 1.13). But how to discern the intelligent information from mere information, and how to derive the actionable insights? Much of the interaction with our surroundings can be described as passive. We are mostly passive consumers and passive transmitters of data and information. We pay little attention to the veracity of data and of information. This is particularly true in how we engage and interact on social media. Through our interactions, and without being mindful, we become an active participant in facilitating the propagation of noise, or dissemination of fake news, and in magnifying the effect. Because also how social media platforms have been used, they became a de facto source of news and information for many, raising serious Fig. 1.13 Intelligent data versus big data

1.9 Much Information Little Intelligence

19

concerns about their business model and their role in spreading fake news. As a result, companies like Facebook, YouTube, and others have hired an army of fact checkers to help with the screening of information prior to wide dissemination, to identify the good from the bad, and discern the truth from the fake based on rules and criteria defined by the social platform. This is not a scalable solution and is subject to the interpretation of the fact checker or the organization guidelines in determining what constitutes good, bad, true, or fake. There are no universally accepted rules, at least not yet, for such determination. The role of screening and fact checking should not be reserved just to the platforms or the data providers. By doing so, we enhance our passive role, and diminish our relevance. We grant the social platforms an unjustified exclusivity that further undermine our judgment. Preserving integrity of information is a shared responsibility. Every consumer of data and producer of data assumes an important role. Only then we can achieve some level of integrity when information is disseminated. Big or more data does not necessarily translate into more information or more knowledge. On the contrary, more data implies more junk or noise. We all share a responsibility to improve the veracity of the information and data shared on social platforms. There are few basic guidelines to help define intelligent information and how to distinguish from the common and the irrelevant information. To illustrate the differences, let us consider some simple examples based on different assertions. Assertion 1: “The earth orbits around the sun.”

For many this is a statement of the obvious. In other words, there is no new information or gained knowledge. Consequently, there are also no actions to take. Looking back to when the statement was first made circa 1633, not only was this new information, but also controversial (contrary to church beliefs) with dire consequences that led to the arrest and torture of Galileo.

20

1 Introduction

Assertion 2: “There is 80% chance of rain tomorrow.”

Given a high probability of occurrence, one immediate action is to carry an umbrella. There is knowledge and a clear actionable item in this case. Assertion 3: “More people use cell phones in India than in Turkey.” Recognizing that the population of India is about 1.2 B and that of Turkey about 80 M, it should come as no surprise that more people use cell phones in India than in Turkey. It is a statement of the obvious with no new knowledge, and little or no actions to take. Following the above three examples, we can define two criteria to help us assess potential intelligence information: 1. How much an event is expected or put differently, how much we are stating the obvious or how much it is a common knowledge. 2. What actions can be taken from that knowledge or the knowledge to act. The more an event is expected or the more we state the obvious the less potential there is for intelligent information. Stated differently, the less expected an event is the more potential there is for intelligent information and actionable insights.

1.10

1.10

Measuring Information: Bits and Bytes

21

Measuring Information: Bits and Bytes

How to measure or quantify intelligent information? To answer we need to elaborate on what was discussed in the previous section. As stated above the more an event is expected, the less potential there is for intelligent information, ties very well with an important topic in information theory called entropy. Stated differently, the more a shared information is expected (the less surprising, the more obvious), the less it contains new information or informational value. Entropy, in simple terms, is a measure of uncertainty or unpredictability. The units of measurement are in bits, as in digital binary coding. The definition of entropy, in relation to probability of occurrence, is given by the formula below (only formula in the book!) Xn H ¼  i¼1 Pðxi Þlog2 Pðxi Þ H is a measure of entropy, P(xi) is the probability of occurrence of a particular event P xi, and the symbol ni¼1 is the sum of probabilities of all n-count events. Here it is assumed a base-2 logarithmic and bits that can have a binary value of 0 or 1. A normal logarithmic assumes a base-10 and bits that can have a value of 0 to 9. A simple use-case demonstration is the tossing of a coin (Fig. 1.14). Let us compare the case of a fair coin (equal probability of occurrence for head or tail), to the case of a biased coin with head or tail on both sides. This is a binary event as only two outcomes are possible: head or tail. The case of a fair coin is more interesting because of the equal probability (0.5 or 50% chance) of getting a head or tail. The case of a biased coin is less interesting since we know apriority what the outcome is. The probability is 1 (100% chance) of getting a head or a tail. There is no uncertainty in this case, no guessing, and no different possibilities to consider. We can quantify our statement by calculating the entropy H (in units of bits) in each case. In the case of a fair coin, we have: Fig. 1.14 Binary event with two possible outcomes

22

1 Introduction

Fig. 1.15 File size as stored on a computer

H ¼ 1  ½0:5  log 2ð0:5Þ þ 0:5  log 2ð0:5Þ ¼ 1 bit For the biased coin case we obtain H ¼ 1  ½1  log 2ð1Þ þ 1  log 2ð1Þ ¼ 0 bit One bit (which can have a value of 0 or 1 in binary base 2) is enough to represent the two possible outcomes: head or tail. The biased coin has an entropy equal to zero bit. The less the uncertainty, the less the entropy is. This conforms with our previous discussion of information: the more the obvious is stated, the less information it carries. On a somehow related topic, a popular slogan among financial traders is “sell[short] the news, buy[long] the rumors.” In other words, once a story has made the news it becomes common knowledge, and there is less opportunity for profiting. An important characteristic of entropy, the more uneven the probability of distribution is, the lower the entropy. Stated differently, given a set of possibilities, entropy is maximized when all possibilities or events have equal chance of occurrence. This seems intuitive and is easily seen in the case of a fair coin with equal probability of occurrence for head and tail. In such a case the entropy was found to be equal to 1 (bit). In case of a non-fair coin with 0.8 probability of occurrence for head and a probability of 0.2 for tail, the entropy can be calculated to be 0.722 < 1. As we end the topic on entropy, and the units of bits and bytes, let us consider the following scenario: have you ever wondered how the size (in bytes) of a text file saved on your computer (Fig. 1.15) is determined? In the example below the size of the file test.txt is shown to be 2 KB. Assuming there is a total of 96 characters in the English keyboard, each with equal probability of occurrence, we can calculate the entropy of a single character as log2(1/96) ¼ 6.58 bits or 7 bits/character. Since 8 bits corresponds to 1 byte, a crude approximation for one character is 1 byte. For a simple one page text containing 400 words, where each word is on average 5 characters, the total entropy is roughly equal to 400  5 ¼ 2000 Bytes or 2 KBytes (KB). In other words, the size of such a file will be approximately 2 KB. The file test.txt in Fig. 1.15 contains 2000 characters, corresponding to about 2 KB. Bits and bytes represent probabilistic measurement of possibilities. The higher the number, the higher the possibilities and more potential for actionable insights.

1.11

The Competing Vs of Big Data

23

Is a measurement in bits and bytes enough to quantify intelligent information? In practice bits and bytes are a good measurement for assessing storage space and data throughput requirements when budgeting for hardware purchase and network capacity. It does not however answer the question how intelligent (in the sense of valuable and actionable insights) the information or the data is. Is 1 Gb of data more valuable than 1 Mb? Should 1 Gb of data cost more than 1 Mb? Many data providers and data brokers charge their customers based on bytes. It is a simple formula. It makes sense when talking about data storage and data streaming like how mobile phones or other streaming content providers charge. However, such metrics are misleading and misrepresentative when the real value is in the quality of the data and integrity of the content itself, and not in the size. With the advent of big data, where the level of noise or junk data is greatly amplified, integrity of the data becomes a prime concern and a big challenge. Bits and bytes are good measurements for data storage and network capacity. They are not a measurement of data quality, and of added business intelligent information. Other factors and criteria need to be incorporated to better assess data integrity and intelligent information, one of which is the actionable insights or the knowledge to act.

1.11

The Competing Vs of Big Data

To better understand how to measure and assess the value of information we need to consider the big Vs of big data. Most recognize the four main Vs as in Volume (as in quantity), Velocity (as in timeliness), Variety (as in data format, structured, unstructured), and Veracity (as in accuracy, quality, integrity). The one important and often omitted V is Value. By value we mean the actionable insights. The Vs do not work in harmony, rather they are competing. As the volume of data increases, velocity tends to decrease, and so does veracity. As the need to rapidly extract value increases, the veracity of data may decrease. There is always a tradeoff (Fig. 1.16). There is always a tradeoff in balancing between the many Vs of big data. By speeding the time to extract Value from the data, we risk degrading the Veracity of the data. No matter how fast and good the analytics are, we want to avoid the fast junk-in and fast junk-out. In an ideal scenario and unattainable, we want unlimited data (volume), in many forms (variety), available instantaneously (velocity), perfect data accuracy (veracity), and of immediate value (value). This is equivalent to a singularity behavior. The term singularity, common in mathematics and physics, is used here to describe the

24

1 Introduction

Fig. 1.16 The many competing Vs of big data

Fig. 1.17 Singularity condition

ideal situation where all Vs work in harmony to create a hypothetical, ideal condition. To illustrate the condition, let us consider the simple diagram in Fig. 1.17. The left diagram represents a normal condition where data veracity is limited (height of shaded box), data volume is limited (area of shaded box), and data velocity measured by the timeliness (width of shaded area) is also limited. The more data volume, the slower the timeliness (larger width) as it takes more time to move larger data. In the singularity condition, shown by the right diagram, we have an infinite amount of data (area of infinite narrow vertical bar), corresponding to infinite veracity (infinite height of vertical bar), and infinite velocity corresponding to an infinitesimal timeliness (width of vertical bar near zero) equivalent to instantaneous availability, and with an immediate value.

1.11

The Competing Vs of Big Data

25

Fig. 1.18 Big data at high speed

Expressed in simple terms, the singularity condition is analogous to predicting the right answer to a question, or the correct response to an event at the exact moment the question is framed or when the event is about to occur. Although we are converging to a singularity, it remains an ideal, hypothetical, unattainable condition. Imagine the following two ideal scenarios: firefighters arriving at the exact moment a fire erupts, or police arresting a suspect at the exact moment when a crime is about to be committed. Although we are getting closer to the singularity, in practice we are constantly in search of the optimum solution subject to constraints imposed by the competing Vs. As data volume increases, data velocity and most importantly data veracity tend to reduce. Thanks to advances in technology, we are now able to transmit large volumes of data and at high velocity (Fig. 1.18) Advances in technology are enabling us to stream big data, in a variety of formats and at higher speeds. At the same time, we are confronted with a rapid degradation in data quality and a growing challenge to quickly extract the valuable actionable insights. With the commercialization of big data, our biggest challenge is to ensure the veracity of the data, and to do it in almost real time. Here again something ought to give: a time delay or a slower data delivery for better data integrity, or a rapid consumption of data in almost real time for lesser data integrity and limited business insights. In the rush to secure a place in a highly competitive market, many in the industry have opted for the latter, only to be confronted later with new challenges and data quality issues. From the business perspective without the V for value, and a clear determination of the actionable insights there is no business proposition. In the introduction we

26

1 Introduction

discussed the business value proposition for Twitter. Same can be argued with many other data feeds or social platforms. The realization that data alone is not sufficient, without the proper analytics to extract the actionable insights, started in mid-2000. The marketing sector was among the early adopters of brand analytics to leverage some of the data in social media. Companies like Salesforce began acquiring firms specialized in social media monitoring and analytics like Radian6. This was followed by companies in finance, with big names like Bloomberg and Thomson Reuters acquiring smaller companies to leverage the analytics around the data in social media for sentiment analysis. In late 2011 Social Market Analytics (SMA) was co-founded, all as indicated in Fig. 1.19. The S-curve behavior is typical in industry, driven by early adoption to seize market opportunities and to remain competitive, until a leveling off is reached with full integration. The question for many businesses remains how to monetize their analytics, expand their market share, gain, and sustain the competitive edge. When acquiring a firm, there is an implied assumption that the knowledge is also acquired. In many cases the knowledge residing in the acquired firm is lost because of organizational changes and culture clash. As a result, the acquiring company is left with the tools and the infrastructure but without the know-how. Bigger is not always better! One example is IBM. Over the years IBM tried to revamp itself from a PC maker to a more service-oriented company specialized in big data and analytics. To this aim the company acquired other smaller companies. By doing so it created a mix of tools and analytics that are hard to integrate, support, and maintain. Along the way it also tried to market its flagship product Watson to a wider audience. Many of these attempts failed to meet expectations.

1.12

The Competitive Edge

In tackling a problem, a scientist, or a mathematician, is often driven by a desire to find the correct or the perfect answer. Finding the correct solution (aka closed form) is not always possible or is hard to derive. Many times, a better and faster alternative is to derive an approximate (aka asymptotic) solution. The ability to assess and determine what is the best approach and solution to a problem requires a variety of skills. Quite often the best answer to a problem is not the correct one, rather the one that makes sense. In the financial industry and many other industries including sports, healthcare, or law; it is not uncommon to find quants or data scientists with advanced degrees in mathematics, physics, or computer science. When “front desk” financial traders are confronted with a problem pricing a complex structured product, they turn to their in-house quants for help. By the time a quant has an answer (probably best fitting a research paper), the trader has missed the market opportunity

Fig. 1.19 Brand analytics early adopters

1.12 The Competitive Edge 27

28

1 Introduction

Fig. 1.20 Outrunning the competition

(the mark) to make the trade. This is a typical example of where a timely solution that makes sense (and probably less fitting a research paper) is more rewarding to the business. Such decisions are best made with a good, combined assessment of the analytics, the time to delivery, and the business benefits. Quite often the best solution to a problem is not the correct answer, but rather the one that is good enough and makes the most sense. For a quant, it is said a project is never finished. For a financial trader if the project is not finished yesterday, it is too late! The search for the perfect solution, if it exists, can be very costly for a business and little rewarding. The definition of “make a sense,” or a “good enough” solution needs to be carefully weighed considering the competitions, market entry time, and other market conditions. A mathematician appreciates the importance of correctly setting a problem, i.e. making the right hypotheses, identifying the proper constraints, and defining the initial and boundary (limiting) conditions. A missed detail or a small variation in the setting of a problem can easily make a problem insolvable, more complex, or lead to drastically different results. Deriving a predictive model that is 80% or 90% of the time correct can be more challenging and more costly than deriving a model that is 50% or 60% of the time correct. As depicted in the diagram below (Fig. 1.20) one does not need to outrun the bear; only to outrun the competition. No matter how good all the other four Vs are, there is little or no business value proposition without a good measure of the Value added to the business and of the gained competitive edge.

1.12

The Competitive Edge

29

Big companies like Apple, Samsung, Amazon, and Walmart are in constant competition to maintain an edge. When movie streaming was starting to become popular, one differentiating factor was the upload time it took before actual streaming began. A difference in wait time in the order of a few minutes was enough to impact a customer’s choice between Apple TV streaming and Amazon Fire TV. Predictive analytics played a key role in reducing the wait time by guessing the movie(s) a customer may want to watch and have readily available. Amazon was able to better leverage its data and analytics for such purposes. When Amazon introduced the Prime service and the 1 h delivery option in selected areas, it did so after an assessment of the competitions and deciding on a solution good enough to gain the competitive edge and expand its market penetration. It clearly worked. The online video conferencing Zoom, prior to COVID-19, wanted to be a simple and easy to install alternative solution to WebEx. With COVID-19 the company was well positioned to quickly leverage the opportunities. Others like Microsoft Team and Google Meet joined the competition. Google determined that adding a noise cancellation feature for its Google Meet solution might help gain the competitive edge. The time to establish a market presence, and what incremental features to add are part of the competitive game to stay ahead. The advent of big data makes it possible for a company to monitor in real time the performance of its product by listening to online chats and posting reviews. It can also monitor the competition. This is added valuable information that helps a company assess its competitiveness, identify the differentiating factors relevant to customers, and quickly adjust its marketing strategy, and its product lines. Before tackling a problem, the business context must be first well framed. Without, all subsequent efforts can lead to misleading or incorrect results, no matter how sophisticated the analytics are, and how advanced the technology is. The implications for a business can be very costly. A properly framed business problem must consider previous findings, clear statement of scope, targeted audience, differentiating factors, areas of competitive edge, key measurement metrics for success, assessment of expected added value to the business, and potential increase in revenues. Such details are typically part of a business requirement document and are critical to obtaining the buy-in from the business stakeholders to fund the project. A typical process flow is best described by the three stages and the three main steps in each stage (3  3), as shown in Fig. 1.21. A good reference on the topic can be found in Thomas Davenport’s book “Keeping Up with the Quants,” Harvard Business Review Press, 2013 Without a properly framed business problem, and the early engagement of stakeholders and business owners, any solution to a problem risks to be a wasted and costly effort.

30

Fig. 1.21 Typical process flow

1 Introduction

2

High Fidelity Data

When we hear high fidelity, we tend to think of audio quality (as in decibels) and image quality (as in pixels or high definition). In this chapter we will introduce a novel concept of high fidelity as applied to big data, how to achieve it, and how to measure it. The 24/7 streaming of data requires us to think differently in terms of systems, technology, and analytics. Some of the old methods applicable to more static data will not work anymore.

2.1

The Telephone Game: Data Sourcing and Transmission

As data travels from source to destination, it jumps through multiple stages (processing and transmission media). Just as how water from a lake or sea can be contaminated during processing and delivery to residential homes, similarly the processing and transmission of data is subjected to potential contamination. A simple analogy is the old popular telephone game many of us enjoyed as kids which is still demonstrated in some early childhood science projects (Fig. 2.1) In this example the data transmitted is of type or format audio, and the transmission media is represented by the two aluminum cans and the string connecting them. The telephone game is clearly a poor method of communication, but it demonstrates how easily a transmitted message can be modified or get corrupted as it travels from a source to a destination. Another version of the game is the whisper game in which the first player comes up with a message and whispers to the second person in line. This is repeated until the last person in line, who in turn announces the message loudly to everyone, as shown in Fig. 2.2. The fun part of the whisper game is when the final message is read loudly and differs from the original message. This can be due to many reasons such as pronunciation difficulty, hearing problems, foreign language, understanding, etc. The higher the number of participants, the more dramatic the effect can be. Each participant is part of the transmission media and a potential source of contamination. # The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 F. A. Harfoush, From Big Data to Intelligent Data, Management for Professionals, https://doi.org/10.1007/978-3-030-76990-1_2

31

32

2

High Fidelity Data

Fig. 2.1 The telephone game

Fig. 2.2 The whisper game

Although data transmission and communication are well beyond the telephone or the whisper game, some basic principles remain. At each stage of the transmission, the data is subject to potential contamination, corruption, and degradation. Data is stored in binary form represented by the digits 0s and 1s. Unlike the telephone game, data is transferred through many transmission media such as storage devices, network cables, fiber optics, or through the air via electromagnetic waves propagating at different frequencies or wavelengths such as Wi-Fi, Bluetooth, and more recently 5G technologies. Data can be encrypted, compressed, and decompressed, dismantled and reassembled, downloaded and uploaded, jump across multiple networks, and moved across different data servers before the information reaches its destination. Once at destination how can we confirm the data did not get corrupted during the transmission. Like in the telephone or the whisper game, how do we know that the data received conforms to the data first published. To illustrate the point, let us consider a common real-life scenario depictive of the telephone game, shown in the diagram below (Fig. 2.3).

2.1 The Telephone Game: Data Sourcing and Transmission

33

Fig. 2.3 Operator communication

Fig. 2.4 Improved operator communication

Although the transmitted message is different from the original, there are certain simple steps that can be implemented to enhance the quality of the transmission and reduce the risks of the message getting corrupted. In the improved scenario (Fig. 2.4) we introduced checkpoints to verify the integrity of the data as it is transmitted from one stage to the next. But what kind of checkpoints? A simple option, to make the point, is to add the number count of characters in the message and append this information to the transmitted message. At each stage, the number of characters in the received message is counted again and compared to the transmitted count. If the counts are different, it is a sign that the message got corrupted, or characters were dropped. It is possible the number count itself gets corrupted during transmission. It is however a less likely possibility as the number count is a shorter message, and so is less prone to corruption. Another option is to transmit a key word referring to a theme or a context of the conversation. The

34

2

High Fidelity Data

theme in our example is data. In some other context it could be fashion, movie, color, date, or weather related. The option of adding the number of characters to a message is analogous to a widely used concept in network engineering known as the “checksum.” A checksum is used to confirm the integrity of a transmitted file over the network. It is basically a datum that measures the sum of digital records (i.e., bits). If an error occurs during a file transmission, where a bit is dropped, or a 1 digit becomes a 0, the checksum of the file at destination will be different than the checksum of the file at the source. It is a powerful way to detect potential corruption of a file during transfer. Note that checksum is a measure of transmission integrity and not a measure of content integrity. There are many ways to monitor and measure integrity of the transmission of a message over a network. A common metric is the checksum which is a measure of the number of bits contained in a record. By comparing the checksums of the original record and at destination, we can detect any errors in the transmission. A typical real-world scenario of data contamination is illustrated in Fig. 2.5. In the above scenario two types of data are sourced: a numerical value, and a numerical code representing established regional, state, or country codes. In the first case the numerical value is rounded (sometimes truncated) when loaded (ingested) into a central database to conform to the table’s column format as defined in the database. The value stored in the database is later edited or corrected by an employee with access to the data. Finally, the edited value is consumed internally by an analyst. From the time the value was sourced to the time it is consumed, it underwent few changes. If later it is determined that results published by the analyst are suspicious or erroneous, how do we isolate where the problem originated from: was it a data issue, a calculation issue, or a storage issue? Only a careful audit would be able to reveal the source of the problem. The second case is trickier. The numerical code must be treated as a character string even though it looks numerical. The code in this case is part of FIPS codes to uniquely identity counties and states in the USA. If the table column in the database is formatted to accept only numerical data, the leading zero in the code will be dropped. The newly obtained value in this case happens to be also a valid code, corresponding to North Carolina (NC) state instead of an LA county. It would be hard to detect the error in this case. Only when the numerical code is put in a wider context in relation to some other data, it would be possible to detect the error.

35

Fig. 2.5 Typical data contamination

2.1 The Telephone Game: Data Sourcing and Transmission

36

2

2.2

High Fidelity Data

From Audiophile to Dataphile

We get what we pay for. This is especially true when shopping for audio equipment, such as headphones. The price can vary from less than $10 to thousands of dollars. Audiophiles are enthusiastic about high-fidelity sound reproduction, and willing to pay the premium price tag for better sound quality. Likewise, dataphiles can be defined as those enthusiastic about high fidelity data. Dataphiles are concerned about the quality of the data and preserving integrity of the data during transmission. Like audiophiles willing to pay a premium price for high fidelity (HF) audio quality, dataphiles are concerned in the quality of data, and willing to pay the premium price for high fidelity (HF) data. Some of the engineering concepts in HF audio transmission such as reliability, correction, filtering, contamination, and interference also apply to HF data transmission.

2.3

Interference and Data Contamination (Signal-to-Noise)

The more data travels, the more the risks of the data getting corrupted. As data gets also bigger, there is a higher chance of data contamination. The relation between data size and noise (i.e., contamination or unwanted data) is illustrated in Fig. 2.6. As data gets bigger the quality of the data degrades. One way to quantify the quality of data is to measure what is called the signal-to-noise ratio (SNR). The bigger the data size the lower the Signal-to-Noise Ratio (SNR), and the harder it is to extract the signal from the data. Signal here is intended to represent intelligent and valuable information. Noise is about erroneous, junk, or undesired data. Signal-to-noise applies to many areas and dates back to early topics in communication engineering on how to measure and preserve the quality of transmission, or propagation, as data or other types of signals (i.e., audio, video, text, power, energy) travel from one location to another. In Finance it is analogous to the ratio of return over risk, also known as the Sharpe ratio. The concept of noise has gained relevance as more streaming data has become available. In its simple form, SNR is the ratio of the data mean or average (in finance known as expected return) over the data variability (in finance known as risk or volatility) measured by the standard deviation STD of the data. Another variation to SNR is the Coefficient of Variation (CV) which is the inverse of SNR or the standard deviation over the mean.

5.0

7.5

10.0

Data Size

0.00

2.5

0.00

0.50

0.75

1.00

0.25

0.0

Harder to Filter/Separate

Signal

0.25

0.50

0.75

Fig. 2.6 Signal-to-noise ratio

Signal

1.00

Higher Signal-to-Noise

0.0

2.5

Data Size

5.0

Range of Quality

7.5

10.0

Low

High

2.3 Interference and Data Contamination (Signal-to-Noise) 37

38

2

High Fidelity Data

Normal Distributions 1 m =mean s =standard deviation s 2 =variance

m=0, s 2=0.2 m=0, s 2=1.0 m=0, s 2=5.0 m=–2, s 2=0.5

0.8

0.6

(Higher SNR)

0.4

0.2

(Lower SNR) 0 –5

–4

–3

–2

–1

0

1

2

3

4

5

Fig. 2.7 SNR for normal distributions

In practice we want to increase the SNR. This can be achieved in two ways. One way is to keep the mean in the numerator constant and lower the STD in the denominator. Second option is to keep the STD in the denominator constant, and to increase the mean in the numerator. When data is normally distributed (Fig. 2.7), a visual inspection of the curve can quickly reveal which distribution has a higher or lower SNR. For the same mean, the wider the curve the higher the STD and the lower the SNR. Such behavior is in a way intuitive. The higher the STD (data variability) is, the higher the volatility of the data will be and subsequently the higher the risks. Conversely the lower the variation of the data is, the lower the volatility will be and the lower the risks. In the two limiting cases, a zero signal with finite noise, or infinite noise with finite signal are similar. Both cases lead to a zero signal-to-noise ratio. Streaming big data is more about the latter case. The challenge is to extract the limited finite signal from an almost unlimited infinite noise.

2.3 Interference and Data Contamination (Signal-to-Noise)

39

Fig. 2.8 Noise categories

While the focus is most on the signal, it is equally important to also understand and analyze noise. Not all noise is bad. Noise can be described in different ways and is not just about network interference or the loss of bits and bytes. Missing data, duplicate data, manipulated data, mis-calculated data are few examples of what can be treated as noise. In a broader definition noise is the unwanted, the uncommon, or the undesired. Noise is not always bad and can sometimes reveal valuable information about the data. In practice noise can be separated into three categories, as represented in Fig. 2.8. Each category is handled differently. Bad noise is data that has been determined to be incorrect, missing, or duplicate. It is relatively easy to detect bad noise given well-established rules and formulas. Fake noise is data that has been maliciously injected, manipulated for the purpose of gaming the system, or for misleading. Fake noise is much harder to detect and requires a longer time to identify. Good noise can provide some valuable information, which may help, for example, with training an AI (artificial intelligence) system or improving a predictive algorithm. It is important to note that AI and robotics, as with humans, learn as much from failed attempts and bad predictions as from successful ones. Many times, a data outlier is treated as bad or unwanted. An outlier is an uncommon behavior but is not necessarily bad or undesired. An outlier can be a legitimate value and indicative of an interesting case. Only an in-depth analysis can reveal the nature of the outlier. Of all three categories the dominant ones are bad and fake noise. When working with data it is crucial not to disregard any noise in the data until a determination has been made as to the type of the noise: good, bad, or fake. The common practice, represented by A in Fig. 2.9, is to ignore bad data. Fake data is hard to identify upfront. There are however increasingly new techniques, partially due to AI, to help detect what can be labeled as fake data. The more

40

2

High Fidelity Data

Fig. 2.9 Handling bad data

advanced approach, shown in B, is to capture the bad data in separate containers (error log files) for further examination and categorization. This extra work can be costly for a business but is very rewarding in the long term. It is part of fostering a culture of dataphiles and implementing the technical solutions to achieve high fidelity data. Some of the techniques are described in the next section.

2.4

Monitoring, Detecting, Resolving, and Reporting

With 24/7 streaming data, it is more essential, and more challenging, to check integrity of the data. We need the tools for monitoring, detecting, resolving, and reporting. Many factors can contribute to poor data quality such as network transmission, software programs, hardware storage, or human introduced manual errors. Some examples are: – Network packets drop over wire transmission, like characters being dropped from a message. – System errors introduced by rounding or truncating a number. – Software errors when uploading or downloading large data sets. – Archiving and unarchiving large files. – Compressing and decompressing messages. – Technology platforms change or version upgrade.

2.4 Monitoring, Detecting, Resolving, and Reporting

41

Various tools can help with the early detection and the mitigation of data quality degradation risks. Investing in such solutions at an early stage helps avoid the fast garbage-in and fast garbage-out at a later stage.

Monitoring Monitoring is about developing and incorporating the tools to collect the metrics that help determine the “health” status of the data as it is transferred. Monitors act like checkpoints as data moves from one stage to another. Monitoring can be invasive and may interfere with the flow of data and slow it down. It is a cost worth paying in exchange for improved data quality, and for achieving high fidelity data. Metrics is not just about measuring. It is also about defining the right measures. For example, at each checkpoint we can calculate a summary statistic representing the footprint of the data such as number of rows, number of columns, and missing entries. Such statistics are computationally not costly and have little impact on the flow of the data. Any difference in footprint of the data from one checkpoint to another can be a sign of a problem. Monitoring begins when the data is first sourced. The source can be a database, a spreadsheet, or other data sources like files consumed via a file transfer protocol (ftp) or sometimes live data feeds via different channels of delivery. It can also be data from mobile devices such as GPS locations. Once sourced, the data is processed and usually stored in a data warehouse for users to access. Users can be internal staff and/or external paying clients. A typical scenario, using an ETL (Extract-TransformLoad) process, is illustrated in Fig. 2.10. Many commercial ETL tools are available for that purpose. A detail often missed in an ETL process, or similar, is a separate “error logs” database. Such a database serves the purpose of storing data rejected by ETL due to data integrity violations. An error logs database greatly improves the quality of the data, by filtering out potential data errors that can be analyzed later. In the age of big data, any ETL like data processing pipeline must consider some of the challenges associated with streaming data and the pressing need for a quick resolution when an error in the data is detected. As described in our earlier paradigm shift, it is like conducting a surgical operation on a patient in a speeding ambulance. Monitoring is about developing and incorporating the tools to measure the health status of the data as it travels. Metrics is not just about measuring. It is also about defining the right measures. Implementing an ETL solution for streaming data introduces undesired overheads in terms of speed of processing, and the rapid availability of the data. This has led to a variety of mitigating solutions such as distributed processing to speed up the process of ETL. Other solutions have opted for a different approach, by loading

42

2

High Fidelity Data

Fig. 2.10 Data ingestion (i.e., ETL) and error logs

the data first to speed up the availability of the data. Such a solution pushes the burden to transform and extract the data further downstream, and sometimes to the consumer. It can be thought of as a reversed ETL process, or LTE where the data is loaded first and later transformed and extracted. It is a choice that comes at a cost: rapid data availability (timeliness, velocity) for less data integrity (veracity). For some the preferred choice is to obtain the data as is, and as quickly as possible. It is a choice that makes sense for a business that relies on its in-house expertise and developed tools to further process the data and resolve data integrity issues. Several hedge funds fall in that category.

Detecting Detecting is about determining if there is an unusual event and identifying probable causes. With good monitoring tools in place, detecting becomes easier. For good detection we need the right metrics, and the proper rules or criteria to assess if an error or an unusual event has occurred. Detecting can assess the nature of the error, its severity, and help with attribution to isolate possible causes. There can be multiple reasons why, for example, a data footprint changes when data is moved around. They can range from vendor problems (data), access type (network), software (code), operator mistake (ops), database issue (DB admin), schema issues (DB design), and conversion errors (code). Defining the right metrics is a key for good detection. Earlier we discussed checksum to check integrity of a file transmission (in binary format) over a network.

2.4 Monitoring, Detecting, Resolving, and Reporting

43

Fig. 2.11 CUSIP code and check digit

The concept of a checksum is used in many other applications. A good example in finance is the CUSIP code. CUSIP is a nine-digit numeric or alphanumeric code used to uniquely identify US Government Securities (Bonds and Notes). The ninth digit is a “check digit” automatically generated (Fig. 2.11) based on a formula using the first eight digits. Its sole purpose is to check the integrity of the CUSIP. If during a transmission or a manual entry a digit is erroneously changed or entered, the check digit code is used to check any consistency violation. Quite often in practice, the ninth digit is dropped or not captured for reasons such as reduced storage requirement, lack of internal quality checks, or lack of domain expertise. The implications can be dramatic. A mistyped or mis-transmitted CUSIP code, for example, will result in the incorrect bond selection and can lead to trading losses measured in the millions of dollars. Other metrics that help identify data errors include missing data field, duplicate data, or logically inconsistent data such as a negative value. In assessing the health status of the data, each metric can be assigned a weight factor subject to its relevance and severity of impact. In doing so, we must, for example, differentiate between a must-have data field or attribute (like the interest rate of a mortgage loan) and an optional nice-to-have data field (like the loan applicant’s height or weight). A missing optional data has less of an impact on the data integrity and should be assigned a lesser weight factor than a must-have field. All weight factors can then be aggregated to derive a normalized scoring number (on a scale of 1 or 100). A score of 1 or 100 represents a perfect data quality score. A monitoring dashboard with drill down capabilities can greatly help with the quick detection and isolation of a data quality issue in almost real time. Detecting is about determining when a problem or an unusual event has occurred, assessing the severity, and identifying potential causes.

Resolving Resolving is about what remedy action to take when a data problem is detected. Any resolution must differentiate between the three types of data motion: the relatively

44

2

High Fidelity Data

Fig. 2.12 Repairing bricks in a wall

static data or slow-moving data, the near real-time or quasi fast streaming, and the real-time fast-moving data. With 24/7 streaming data we do not have the luxury to stop the assembly line and make a correction. Speed of execution is of essence. To illustrate the challenge, let us consider the illustrative scenario where a small data set, call it a data brick (like a brick in a wall), was found to be defective, i.e. the data is bad. How to correct the data? What if the data has already been published and sourced? Not an easy task. The analogy of a brick in a wall, where each brick represents a small data set, is shown in Fig. 2.12. To repair a wall there are two possibilities, depending on the severity. – Replace the few bad bricks. This is equivalent to replacing the few affected data points or the small data sets. This scenario assumes the impact of the change is localized, and there are no referential dependencies. By referential dependency we mean changing a data set that will impact other data sets. In such a case, we need to correct all affected data. – The second scenario is to replace more than the few damaged bricks, like the entire wall. This is the case if the bricks are, for example, part of a structural foundation. This is equivalent to data residing in a core table with many referential dependencies. In this case, we may need to reconstruct or to repopulate the entire database.

2.4 Monitoring, Detecting, Resolving, and Reporting

45

Correcting an error, when data is streaming or is already published, require the quick detection, and the rapid resolution. The longer time it takes to resolve, the harder to mitigate the impact. For the quick detection and the rapid resolution, we need new methods of assessment, and new tools for intervention. A quick assessment of impact should consider questions like: is the problem localized or not? How long has it been since the problem was first detected? The more localized and the shorter the elapsed time, the more effective the intervention. Some of the concepts and ways to correct a streaming data problem find their roots in advanced signal processing. For example, we can cancel or alter a broadcasted signal by sending a cancellation signal or by replacing it with a new signal. So how does this relate to streaming data? Assume it has been determined that a portion of data already published was found to be erroneous. We can stream a cancelling data set (via a message, a code instruction) or replace it with a new data set. It may seem complicated, but the concept is already applied in many circumstances. A common example is a published message via email or other messaging apps like WhatsApp. If the message is sitting in the pipeline, or the user’s inbox, and has not been read yet, it is possible to revoke, or replace the message with a new one. Usually there is a very small-time window of opportunity, before the information is consumed, when a correction is still possible. Hence quick detection and rapid intervention are key. Corrections can happen programmatically and rapidly without manual intervention. The system can learn and correct itself the moment it is aware of a data problem. Such smart systems can be implemented with the help of machine learning and artificial intelligence. Alternative ways are hybrid solutions combining machines and humans. Machines assign an impact factor based on the severity of a problem and suggest a remedy. An expert can review the suggested remedy and execute. This allows for the efficient leverage of resources and timely intervention. Any data correction must be executed according to the organization’s internal rules for data governance and for audit. In case of changes to the database, tools are available to control and record any changes to a database (both data and tables schema) using database versioning tools such as Microsoft Team Foundation. Like source code control and versioning tools (i.e., CVS, SVN, GIT), database versioning tools allow for the create, update, restore, revert, merge, and versioning of databases already populated with data. Versioning a table means maintaining a historical record of both the tables schema, and of the data in the tables, as illustrated in Fig. 2.13. Depending on complexity of tables schema and amount of data, database versioning carries the risk of loss of data, or other undesired results, which may not be immediately obvious and not easy to recover from later.

46

2

High Fidelity Data

Fig. 2.13 Database versioning

Monitoring is measuring the symptom like the blood pressure, or the heartbeat of a patient. Detecting is diagnosing the symptom such as a high blood pressure and determining a probable cause. Resolution is about prescribing a recipe to remedy, like a medicine to reduce the blood pressure.

Reporting Reporting is about the actionable insights, emphasizing the differentiating factors, and the changing conditions. This is the last and most critical stage. Upper-level management, C-level executives, and stakeholders depend on reporting for key decision making. The difference between a good reporting and a bad reporting can mean the loss of business opportunities, and sometimes the loss of lives (Edward Tufte on the 1988 Columbia shuttle disaster). In the age where big data drives much of the business decisions, it is recommended to incorporate the activities of monitoring, detecting, resolving, and reporting in one team dedicated to the overall Data Quality and Metrics (DQM). This is a role different from the traditional Quality Assessment (QA). QA is more concerned with executing predefined processes and working with established internal guidelines for reporting. It is best suited for business intelligence. A DQM like team can quickly adapt, develop new tools, set new metrics, and assess the quality of the data in real time from inception to delivery. DQM is more fitting business analytics. Reporting is less about stating the status-quo, and more about highlighting the differential, the unusual, and the competitive edge. A familiar measure of differential

2.4 Monitoring, Detecting, Resolving, and Reporting

47

is the standardized score, also called the z-score. A z-score is a measure of deviation from normality (assuming normal distribution). It enables the comparison of different sets of results, on an equal basis, like SAT and ACT scores. The concept of z-score can be applied in many circumstances (assuming a near normal distribution) to detect unusual activity. A withdrawal from a bank’s account can be flagged as unusual, if the corresponding z-score exceeds a limit based on a rolling time window of historical withdrawals. We can measure a change from the norm (the typical, the baseline) in a wider context by introducing a new metric called the differential d-factor. The concept can be applied to different situations including human behavior. For every individual we can define a normal state. Big data allows us to migrate from the group definition of normality (typical) to the individual definition of normality. Each individual’s normal state is different. Some may have high level of frustration tolerance and some low. In marketing more and more everything is promoted at the granular individual level. Programmatically it is easier to identify similarities than differences. Many social platforms such as Facebook, LinkedIn suggest networking with the similar others. Other online services such Amazon and Netflix offer suggestions based on an individual’s previous views and purchases. They tend to reinforce the norm or the status-quo. Sometimes the valuable insights, and the business opportunities, come from identifying and measuring the d-factor. Consider the case of online professional networking. Most of us tend to professionally connect with people who share our similar skills, similar interests, and comparable credentials. It is another example of the social platform reinforcing our norm. When looking for a new job, chances of landing an opportunity among those who share similar skills and similar profiles are not necessarily better. People in our network are probably interested in similar job opportunities. The odds of finding a job can be higher outside our own network, where there is less potential for competitiveness. In any reporting what is most interesting is the unusual observation, and the results that deviate from the norm. The norm is defined as representative of the status-quo or the baseline under given conditions. Any deviation from the norm can be measured by introducing a new more general measure called the differential factor or d-factor. To illustrate how the differential d-factor can be put to practice, let us consider a real-world example based on my own experience and within the scope of data quality and metrics. We start by defining the footprint (normal state) of the data at inception. A normal state does not necessarily imply a perfect state. As data is moved around, any deviation from the normal state is measured by the d-factor. Based on the type and magnitude of the d-factor interventions for remedy can be prioritized. Assume, for example, the data set is composed of one million rows and 100 field columns. This corresponds to a total of 100 million entries. A data footprint will be

48

2

High Fidelity Data 100

Analyzed Data1

90 80 70 60

1.79%

50 98.11%

1.89%

0.09%

40 30 20

0.01%

10 0

Good

Hole

Others

Partially Filled

Rules Violation

Data Quality Meter

Fig. 2.14 Data quality meter

based on a set of metrics calculated from the 100 million entries. Metrics can consider details such as how many missing entries, how many duplicates, how many must-have attributes, how many nice-to-have optional attributes, business logic field integrity, and field relationship integrity violations there are. Defining the metrics for a footprint requires a good understanding of the data and the business. As data is transferred, the data footprint is recalculated at each established checkpoint. Say for example, it was determined the recalculated footprint has 100 missing entries. This corresponds to 100 missing out of 100 million entries or a d-factor equal to 1 PPM (part per million). A PPM is a generic unit of measurement. A similar PPM can be calculated for all other metrics defining the footprint such as duplicates and business logic rules violations. As data travels a set of metrics, representing the data footprint, are re-calculated to determine the d-factors. Depending on the type and severity level of each d-factor, a more efficient and timely intervention can be implemented, combining humans and automations. One good way to report data quality metrics is a scoresheet. A scoresheet is a quick visual assessment, sort of a temperature check of the data health status. It is designed to be simple to interpret and informative. An example of a scoresheet is shown in Fig. 2.14. The score here is reported on a scale from 0 to 100, with 100 being perfect data quality. The temperature scale color coding of green and red is selected to be intuitive and representative of good and bad, respectively. The targeted audience in many cases are upper management and C-levels. By incorporating smart and frequent automated reporting, we can foster a culture conducive to high fidelity data. A separate detailed reporting with drilling capabilities allows for more in-depth investigation and attribution.

3

Connecting the Dots

3.1

The Internet of Things (IoT)

With the advent of big data, the biggest competitive edge is the aggregation of data from multiple sources. In this chapter we will explore some important concepts and applications of connecting the dots. IoT is often used to describe physical objects with embedded technology (i.e., sensors) that allow for the interaction or the two-way communication with its surrounding. Consider the smart appliance fridge that reminds us when we are low on milk. With two-way communication we can inquire if there is milk in the fridge. Imagine now multiple physical objects with similar embedded technology such as the car, the TV, the washer, and others as portrayed in Fig. 3.1. The hardware and technical details of smart sensors are outside our scope. Our interest in IoT is how to leverage the streaming data collected from many sensing devices and do it in almost real time. Let us consider a simple example. Waze is a popular application, based on crowdsourcing, where subscribers report traffic related events such as accidents, congestions, and police presence. It has gained wide adoption since it was first introduced years ago. On its own Waze provides valuable information to a commuter. However, if the information from Waze is associated with data from other sources a higher value information can be obtained. A commuter delayed because of traffic congestion can have an alert automatically sent to a contact, or a meeting organizer. In this case the data from Waze and GPS location are aggregated with data from a calendar to generate the alert.

3.2

Data Aggregation

By data aggregation we mean bringing data from different sources to create an aggregated or unified database. Data sitting in silo provides a limited level of information. A higher value is obtained by aggregating the information from # The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 F. A. Harfoush, From Big Data to Intelligent Data, Management for Professionals, https://doi.org/10.1007/978-3-030-76990-1_3

49

50

3

Connecting the Dots

Fig. 3.1 The Internet of Things (IoT)

different sources. Building an aggregated database can be a challenging task. It is however very rewarding in identifying new actionable insights and providing a competitive edge. It is an example of where the whole is greater than the sum of its parts or put simply where the sum 1 + 1 is greater than 2. As much as IoT has contributed to the advent of big data by generating new data, it has also enhanced the importance of connecting the dots or the data from different sources, called data aggregation or data unification. It is where 1+1 > 2! Connecting the dots reminds us of the game of dots to complete a picture as demonstrated in Fig. 3.2. Which dot to start from? Are there missing dots? Any bad dots? Which way is faster? With big data and disparate sources, the task of connecting the dots may not be easy, but the rewards justify the costs. Rewards can be in increased revenues, better customer retention, improved national security, and better health monitoring. A real-life example depicting the challenges associated with data aggregation, even in simple cases, is shown in Fig. 3.3. Given the three types of ID cards—a driver’s license, a social security card, and a health insurance card—it is hard to determine if they all belong to the same individual. There is no common unique identifier. Each ID card has its own unique identification number. The only way to connect all three IDs to the same person is by referencing the name. The name can however be spelled differently as in Jane Doe or John Doe. One recourse to solve this problem is to use fuzzy logic (part of fuzzy mathematics) and determine with some level of confidence how probable the three names belong to the same person. With fuzzy logic we do not need to be 100% correct. An 80% or sometimes 60% accuracy can make a big difference and provide the competitive edge. After 9/11 many travel logistical changes were introduced to address security concerns, and the early detection of potential threats. Identifying and tracking suspect terrorists, before an act is committed, became a prime concern. New

3.2 Data Aggregation

51

Fig. 3.2 Connecting the dots

Fig. 3.3 Disjointed identification cards

guidelines were introduced to screen travelers coming to the USA based on their ethnic background, place of travel, name, ethnicity, and religion. Some of these guidelines were later dismissed, not just for being discriminatory but also for their inefficacy. Let us consider the typical scenario in Fig. 3.4.

3

Fig. 3.4 Identifying and tracking

52 Connecting the Dots

3.2 Data Aggregation

53

Fig. 3.5 Aggregated or unified database

Based on early 9/11 aftermath guidelines, a white person named John travelling from France to the USA would probably not raise suspicions. Instead, a darker skin person named Ahmed travelling from Cairo to the USA may trigger an alert. The person’s activities like renting an apartment, getting a speeding ticket, making international calls are all recorded in separate databases. Each database provides limited information, when analyzed in isolation. By aggregating the data from all sources, a bigger picture emerges that can better assess potential security threats. In our example, there are about ten different databases each with its unique identifiers for recording a transaction or an event. Each database is owned by a different organization or agency. The logistics, the technology, the analytics to connect all the databases, and the transactions to the same individual are not straightforward programmatically and logistically. In the USA, a company called Palantir claims to have the edge in developing the tools for aggregating data. Palantir was referred to as the most valuable best kept secret company in Silicon Valley. Not anymore since it went public in 2020. The company gained prominence after 9/11 signing big contracts with US government defense agencies. Palantir was credited in using big data and connecting the dots to help locate Bin Laden. It has since been working on migrating its know-how to wider industries like finance. Illustration of an aggregated database is shown in Fig. 3.5. By combining data from different sources, we can build a more comprehensive picture of a person’s profile some of which may be considered private (a topic we will address later). Globally China is a major player in creating unified databases and social scoring, partially due to the lack of enforced privacy rules. A final example of where connecting the dots can be rewarding is government money wasted on bad payments. Bad payments can range from fraud to checks issued to the wrong person including dead people, or for the wrong amount because of a typo. The amount of wasted money in 2009 was estimated to be $110 billion. In 2010 President Obama issued a directive to stop all bad payments by ordering the creation of a federal “Do-Not-Pay-List,’ a database with data aggregated from different sources. Every government agency would have to search the aggregated database before issuing a check to individuals or contractors. Since the beginning of the COVID-19 pandemic in March 2020 and according to US Labor Department, it is estimated that close to $63 billion has been paid improperly through fraud. Clearly, we still have a long way to go to connect all the dots.

54

3

Connecting the Dots

In a digitized universe every event, and performed transaction generates a digital record that resides in the clouds, i.e. a database or a data repository. The format of the digital record can vary among textual, visual, alphanumeric code, numeric code, bar code, facial, audio, video, biometrics, fingerprints, etc. . . The pursuit of an aggregated database has become the secret recipe for extracting the valuable insights, and for gaining the competitive advantage.

3.3

The Golden Copy

In almost all cases of data aggregation, the lack of a common unique identifier makes the problem hard to tackle. Earlier we presented the common case of three types of ID cards. There is no shortage of examples and potential applications. A less common case but very useful when analyzing financial data is described below. Different types of financial instruments or securities (equities, fixed income, commodities, foreign exchanges, derivatives, etc.) are traded in the financial market. Some are listed on the major exchanges (i.e., NYSE, NASDAQ, LSE), others are traded Over-the-Counter (OTC). For each security there are four key descriptive characteristics: the company that owns the security, the name and type of security, the country of origination, and the exchange where the security is listed, and the price quoted. Some may be listed on multiple exchanges (dual listed). A trader, a portfolio manager, an analyst, or a researcher many times are interested in questions of the sort “how many securities and of what type, a company has issued or owns?,” “on which exchange(s) a security is listed?,” “in which country the exchange is located?” The ability to process such requests in a timely fashion requires a database architecture capable of aggregating the data from different tables in the database. In silo each table provides limited information like how many securities of a particular type there is. A good analogy is answering a survey question of how many children a family has. The relatively harder question is how many children share the same biological parent(s). A security can be viewed as the child and the company as the parent. Added to this are countries of origination and listed exchange(s). Financial data providers recognized the value added and the competitive edge in the ability to connect the dots, and to view the relationships between companies, type of securities, country of origination, and listed exchanges. In the financial industry this is known as “symbology and concordance” and is key to creating an aggregated database and the golden copy. Companies like FactSet, Bloomberg, Thomson Reuters, and IHS Markit (part of S&P Global as of December 2020) have differentiated themselves with their proprietary “symbology and concordance” and their golden copy (Fig. 3.6). An aggregated database is like the golden copy. No business would want to share its secret recipe in creating the golden copy.

3.3 The Golden Copy

55

Fig. 3.6 The golden copy

Other details can complicate the task of creating an aggregated database. In the example from finance, it is a common practice to have mergers and acquisitions (M&A). As a result, the name of companies, ownerships, acquirers, and acquirees can change over time. To capture such information, we need to add another dimension to the aggregated database: a timestamp or a point-in-time table to record the historical relationships. It is easy to imagine the challenges in building a tree map that varies in time. It is like tracing someone’s ancestry roots that change over time. Consider a study that wants to look at JPM’s (JP Morgan Chase & Co) historical stock prices dating back to earlier than 2000. As indicated in Fig. 3.7, prior to year 2000 JPM did not exist. Instead, there were many independent companies that merged over time. In deciding which pre-merger companies to choose from the tree map shown below, other company information must be factored in, like which company was the acquirer and which company the acquired.

56

Fig. 3.7 Mergers and acquisitions: a point-in-time tree map

3

Connecting the Dots

4

Real-Time Analytics

4.1

Faster Processing

With the advent of big data, the competition for many businesses is to extract the actionable insights and do it fast. The speed of processing and the timing can make a big difference in a very competitive market. Many options are available to speed up the processing of data. Solutions have evolved to address two core needs: the fast retrieval of data and the fast processing of analytics. Special database appliances like Netezza (IBM) have been developed for the fast storing and fetching of data. Any specialized solution like Netezza comes with its own programming and querying language to best leverage the architecture of the hardware. Such solutions come with the added cost of requiring specialized hard to find and not easily transferrable skills. Other examples are transactional databases (like Pervasive PSQL or Btrieve) best fitting tick prices like stocks, and vertical columnar databases (like Vertica, Sybase IQ) best suited for time series historical data. In terms of computational speed some technologies offer dedicated and/or programmable chips, such as FPGA (Field Programmable Gate Array) which greatly speed up the calculations. FPGA is used at some of the big online platforms such as Facebook, Baidu, and financial firms such as the CME. Technology has come a long way since the days of programming at the assembly language level (i.e., Hypercube) or the vectorization programming (i.e., Supercomputer Cray). Although technology has evolved, some core concepts remain. The more specialized technology is, the more specialized the skills are. Other solutions include the use of parallel processing and distributed computing. Such solutions work well when the analytics and the computations can be broken into small chunks and run independently. In other words, there is no dependency or the so-called coupling between the chunks. Otherwise, each chunk will be in a wait state until the dependent chunk(s) is/are processed. There are many examples in real life where we encounter a coupling effect, beyond the analytics and the computations. Social media platforms like Facebook or Instagram are typical. In liking a post or in responding to an ad, we are often influenced by the number of likes and the identities of some likes. In modeling climate changes, the melting ice in the # The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 F. A. Harfoush, From Big Data to Intelligent Data, Management for Professionals, https://doi.org/10.1007/978-3-030-76990-1_4

57

58

4

Real-Time Analytics

Fig. 4.1 Coupling and interference

north pole, the burning fires in California, and the extinction of animal species in Africa all are interconnected. Another example is the travelling electromagnetic waves. Electronic gadgets when placed in proximity of each other carry the risk of interference and possible signal degradation or contamination. When modeling the quality of reception and transmission for many wireless devices we need to include the coupling effect depending on the proximity of the devices to each other, as demonstrated in Fig. 4.1. Solving a problem when there is coupling and interference is much harder. Another form of coupling with big data is a looping or feedback effect. A good example is in marketing. Traditional marketing distinguished between a producer and a consumer. Such separation made sense historically. With big data many times the consumer of data is also the producer of the data. This creates a feedback effect. It also raises important ethical questions in modeling the behavior of a consumer, and potential biases in how AI models are implemented. As the saying goes if the product is free, it means the consumer is the product (Fig. 4.2). In the big data game, every active player is both the consumer and the producer of data. When building models, we should be concerned about the feedback loop effect which can cause an undesired bias in the model and reinforce a status-quo.

4.2 Analytics on the Run

59

Fig. 4.2 Big data game

4.2

Analytics on the Run

The closer the analytics are to the data, the faster the analytics can be run. This realization has been around for a long time. Online Analytical processing (OLAP) allows for the quick analysis of data residing in a SQL like database, by aggregating the data along different dimensions of the data like sales per region and per agent, and creating different 3-D views, the so-called cubes. Commercial products like Crystal Report and Microsoft have integrated OLAP solutions in their products. Other similar products include Actuate e.Report. OLAP has been much part of the business intelligence (BI) type of reporting. Proprietary in-house solutions were also developed to speed up the calculations by bringing the analytics closer to the database, allowing for simple calculations like adding, subtracting, or multiplying two columns to be performed on tables within a database. Such capabilities are limited in scope as the format of data has become more varied, and databases have evolved beyond the SQL based. Instead of fetching the data first and running the analytics later, new techniques look at running the analytics while fetching the data. This reduces the amount of data moved around, and the latency to process. This can be accomplished, for example, by developing scripted commands or requests (script query like language) that incorporate some business logic in the request. Let us suppose we are interested in historical end-of-month (eod) data. We can include the “eod” option in the scripted command when fetching the data from the database. In another instance, we might be interested in the average of a time series data. Each time the data is updated, the database can automatically compute and save a running average. This way the average is immediately available when requested. One can also incorporate columnar or vector calculations when fetching data. A columnar database is best suited to perform quick arithmetic vector calculations on the columns, like adding, subtracting, and multiplying. In this case, the operation is performed in a simple command over a large set of time series data.

60

4

Real-Time Analytics

The more the calculations are performed closer to the data, the faster the processing time. With streaming data, the need for quick analytics is more urgent. In Chap. 1 we discussed a clip from the 2017 movie “The Circle.” Two quotes from the clip were noted: “real time analytics” and “knowing is good but knowing everything is better.” In a 2002 movie “The Minority Report” the lead actor (Tom Cruise) is looking at a dashboard projected on a glass screen and is manually connecting the information/data from different sources. The two movies, released at very different times, depict a reality that nowadays is less fictional, and highlight the benefits of connecting the data in real time to extract new valuable information. The business gains/losses for fast/slow analytics can be tremendous. In my early career in finance, one urgent task was to calculate the risk numbers for different portfolio trading positions. Because of the sometimes-long processing time, the customary approach was to have the calculations run programmatically overnight at a scheduled time (bash job). The risk numbers were available first thing early in the morning. Reports provided a detailed view of the portfolio risk just prior to the opening of the market. The problem was that the market can drastically move just after the opening and during the day. At times, depending on market moves, the risk numbers had to be recalculated during the day and processed quickly. The business cannot afford to wait. Every minute counts. The analytics in this case must work in almost near real time. To achieve this, many details needed to be resolved to speed up processing time through distributed computing of data access, analytics, and reporting. With distributed computing we are reminded again of the coupling effect presented earlier. A risk number in one type of position (i.e., international bonds) can impact the risk number in another type of position (i.e., foreign exchanges). More recently (March 2020) the world was confronted with the spread of the virus COVID-19. Rapid testing and contact tracing to identify people at risk (coming near someone who has been positively diagnosed) became essential to control the spread of the virus. To reach this goal, monitoring, detection, and reporting had to be done in almost real time. China was quick to implement strict measurements and the technology for contact tracing with little regard to matters of privacy. Many were willing to compromise their privacy in exchange for better health monitoring and a control of the virus.

4.3

Streaming Data

In the traditional definition of business intelligence, data is stored in a database warehouse and is relatively static (updated less frequently). Reports are generated from the database following companies internally established rules and reporting guidelines. Unlike business intelligence, business analytics assumes data is constantly changing. Old methods of conducting business do not apply anymore.

4.3 Streaming Data

61

Data management and governance (see Chap. 7) work well with business intelligence. When data in a database is seldomly updated, it is a common practice to audit the data, ask the what-if questions, and conduct postmortem investigations when things go wrong. Given the tools, we can trace back each step of the process and isolate possible causes to a problem. Different sources can contribute to a bad result. For example, the data might be stored incorrectly in the database due to a format conversion or a truncation while storing the data, bits might drop when data is transmitted over a network, or an automated process to update a database might terminate prematurely with no errors raised. It is also possible that the analytics or the software used in generating a report or the calculation of a result has a bug. It can also be an operational error due to a manually entered value. Given the many possibilities, only a careful audit to trace back in time each step performed, how, by whom, and when, can help isolate the source of the problem. Such an audit is possible when the data and the models are relatively static. With big data, not only the data is constantly changing, but also the models must adapt as quickly as the streaming data. When the data, the analytics, and the models are constantly changing—unless every historical granular detail is saved—it is almost impossible to reconstruct the past and trace back each step of the process. Storing streaming data can be an insurmountable challenge. Years ago, the Library of Congress embarked on an ambitious project to archive all the tweets since the company’s inception in 2006. One can only imagine the vast opportunities for new research work. These tweets represent an unedited daily journal of our views, our politics, our sentiments, our emotions, our aspirations, and much more. The project ran into many technological hurdles, not to mention the logistics contractual details with how the data should be handled, archived, and deleted. As of 2013 about 170 billion tweets were accumulated. Eventually the project was abandoned. An alternative and more manageable solution would be to dispose of the data in lieu of new refreshed data once the analytics are completed and results published. We are again confronted with the question of how to conduct an audit in case of a problem or incorrect result. Unless we maintain a historical record of the data and the conditions (analytics, models, variables) that lead to a particular result, it is almost an impossible task. Instead of the entire history, we can consider a shorter time window enough to preserve the right data and the conditions that led to the result before they are disregarded. This becomes a scaling problem whereby the time window to store records is scaled down from very large to very small. So, what is the right time scale? The time scale will depend on the type of data, the time required to assess accuracy (veracity), and the time it takes for the data to be disseminated. Not all data are equal. For example, it has been suggested that sensational news (bad or fake) such as conspiracy theories or dramatic events tend to travel about six times faster

62

4

Real-Time Analytics

over the social network than regular or mundane news. In other words, if it takes about 6mn for regular news to reach an audience capacity, it will take close to 1mn for sensational news to reach the same audience capacity. Stating the obvious takes more time to spread and requires less time to check integrity. Sensational news spreads much faster and takes longer to assess accuracy. The time scales in each case are very different. Rules for audit requirements will also differ. Working with different time scales, to accommodate different needs, goes back to old legacy systems when data was stored on magnetic tapes. To address rising needs for quick access to the data two types of requests were accommodated separately. Quick access was granted to data up to one week old. Access to older data followed a different procedure and was granted over a longer period. The bottleneck back then was partially due to the limited technology for fast data storage and fast retrieval. Technology has greatly advanced since then. Storage capacity and speed of retrieval are less of a concern. Still in many cases access to more recent data, versus historical data, is treated separately. Twitter, for example, allows a user to access historical tweets on a limited short time window of 3–7 days. Any request to access older tweets is handled separately and monetized accordingly. Most data providers make a clear distinction between access to recent data, and access to historical data. Such distinction is driven mainly by the technology to process small data sets versus big data sets. It is also driven by how the data is monetized. The customers for recent data are more interested in current conditions. The customers for historical data are more research oriented. The cost for the data will differ per need.

5

Predicting the Future

5.1

A Crystal Ball

The idea of predicting the future is an ill posed and a self-contradictory statement. If one can predict the future, then it is not the future anymore. If we can predict the outcome of a fair game, then it is not anymore, a fair game. As once said “the best way to predict the future is to create it” a quote attributed to many famous people among which Abraham Lincoln. Predicting the future is like reading a crystal ball (Fig. 5.1). There are no guarantees, only possibilities. At best one can guess or predict within a certain level of confidence what the outcome or the future will be. In the big data competitive game and predictive analytics, it is not about being correct. It is about being sufficiently better than the rest, faster, and most importantly, being sustainable. Market conditions and documented previous findings will define what sufficiently better is. Framing the problem correctly is critical prior to any advanced analytics and modeling. Quite often it is harder, if not impossible, to find a correct (closed form) solution to a mathematical problem. However, it may be possible to derive an approximate or an asymptotic solution (in mathematical terms). Although mathematically it might be easier to derive an approximate solution, it is not necessarily straightforward. Such details are outside the scope of this book but are important to note. Statistically the problem narrows to finding a solution that is sufficiently good enough and within a certain level of confidence. An example is a business ability to predict what a shopper will buy next, or what movie a viewer will watch next with a 60% accuracy or success rate. To gain the competitive edge another business will only need to be better than 60%. An 80% or 70% can be sufficient to beat the # The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 F. A. Harfoush, From Big Data to Intelligent Data, Management for Professionals, https://doi.org/10.1007/978-3-030-76990-1_5

63

64

5 Predicting the Future

Fig. 5.1 Predicting the future

competition. From a business perspective the efforts to obtain a higher accuracy rate are more time consuming and less justifiable. A small difference can be sufficient to impact user experience and decision making. Consider the common scenario where a user wants to rent a particular movie online or order a product for delivery. A difference in time to load a movie or to deliver a product can greatly impact a customer’s decision making. For example, Amazon prime service offering a delivery time of 24 h versus a 2–3 day waiting period had a big impact in growing its customer base.

5.2

Machine Learning and Artificial Intelligence

Machine learning and artificial intelligence are closely intertwined. Machine learning, a form of AI, has many of its roots in neural networks, Bayesian analysis, and adaptive modeling. Prior to machine learning was adaptive modeling in which the parameters of the model are programmatically adjusted, calibrated to adapt to changing input conditions. Adaptive modeling relies on a feedback loop for fine tuning, a process known as model calibration, and is applied in many areas such as finance, marketing, and supply chain. The concept has been extended to many other applications. Part of my early research work on stealth technology was the study of active surfaces whereby characteristics of a surface such as reflectivity are adjusted depending on incoming signals. More recent examples of active surfaces are based on nanotechnology, such as wearables that adjust the body temperature per surrounding air temperature.

5.2 Machine Learning and Artificial Intelligence

65

Fig. 5.2 Supervised learning model

Machine learning in its simple form is composed of a training data set to derive the model, and a testing data set to validate the model predictability. This type of learning is commonly known as supervised learning—unlike unsupervised learning where there is no training and no testing to assess predictability. Other forms of learning include enhanced learning, shallow learning, and deep learning. The basic concept of supervised learning is illustrated in the flow diagram below in Fig. 5.2. Training data is used to derive a model where the output from the model is compared to an observed actual output. The next step is to apply the model to a different set of data called the testing data to assess the goodness of the model in predicting actual values. In a traditional non-learning model, the process to decide whether we accept or reject the model ends with the testing data. With machine learning there is an additional feedback loop to algorithmically fine tune the model and improve predictability. The model is learning both from its success and from its failed predictions. This is all done programmatically. Consider the example of trying to algorithmically predict stocks movement. The training set may consist of the endof-day prices for a pool of stocks. The goal is to predict stocks’ movement (up or down) at the next day’s market open. Come the next day the model can assess where it failed and where it succeeded in its prediction. The model adjusts itself and makes a new prediction for the day after. This iteration continues with the hope the model will get better each time in its prediction. The model will however never reach perfection to predict the market correctly and consistently. Otherwise, we would not have a market as we know it! It is important to distinguish between a feedback loop in machine learning and a feedback loop in control systems such as a thermostat to control the temperature in a room, or an auto-cruise to control the speed of a car. In both cases, the output is set at a desired temperature or a speed level. The input to the model (i.e., thermostat, autocruise) is adjusted to raise/lower the temperature in a room or increase/decrease the speed of a car until the desired output is reached. In control systems the feedback loop is to alter the input. The model itself is not altered, and there is no prediction. In machine learning, the feedback loop is to alter the model parameters, with the goal of improving the learning and the prediction. Depending on the conditions, machine learning can be categorized as supervised or unsupervised learning. A self-driving car can learn to drive in a supervised mode, like a teenager learning how to drive under the supervision of an adult. In unsupervised mode the car learns to drive by itself, with no supervision. Either way the basic concept is the same where the model

66

5 Predicting the Future

Fig. 5.3 Control system versus learning system

parameters are modified. Figure 5.3 illustrates the fundamental difference between a control system and a learning system. What makes machine learning exciting and to some degree challenging is when the input to the model is 24/7 streaming real-time data. With machine learning and streaming data, the model is constantly adapting and adjusting. As data is continuously changing, the learning never stops. Any machine learning can only be as good as the data it learns from. Over time the learning can get better, but it never reaches perfection. Otherwise there is nothing left to learn! Any quality issues with the data, any bias, any missing data, any purposely altered data will be reflected in how machine learning works. At no times should the quality of the data be ignored in the rush to obtain quick results. A potential pitfall is the so-called self-fulfilled prophecy. This was best exemplified in 2008 when Google introduced Google Flu Trends (GFT). The idea was to predict potential outbreak of flu based on users’ search patterns. Early results were very promising. Not only was GFT reconciling well with the data obtained from the CDC (Centers for Disease Control), unlike the CDC statistics data, which was published per schedule (monthly, quarterly, or annually), GFT data was published in almost real time. It quickly became popular among travelers with kids concerned about flu outbreak in their destined travel. Results were updated in near real time on a map, easy for anyone to view and understand (Fig. 5.4). In 2013 GFT failed miserably, missing the peak of the flu season. This was later attributed to a flaw in the model. Google was misguiding users by suggesting flu related topics based on their searches when in fact there was no correlation. One

5.2 Machine Learning and Artificial Intelligence

67

Fig. 5.4 Google Flu Trends

Fig. 5.5 GFT feedback loop effect

simple example would be someone searching for the topic “it feels cold.” This could be in relation to severe cold weather conditions and not necessarily the flu. The flaw was in the feedback loop where Google suggested topics were also used as input to the model thereby overestimating the impact. This undesired feedback loop effect is illustrated in Fig. 5.5. The undesired feedback loop effect in creating a self-fulfilled reality, or unintentional bias, is among the biggest concerns in predictive analytics and AI. IBM Watson was first introduced in the medical field to help with the diagnosis and treatment of cancer. MD Anderson Cancer Center, reputable for its cancer treatment, was among the first adopters. Doctors could within a short time span have IBM Watson analyze millions of research papers and records. Watson would then suggest different treatment plans prioritized per confidence level of success as shown in Fig. 5.6.

68

5 Predicting the Future

Fig. 5.6 IBM Watson for cancer treatment

Doctors presented with such options may opt for treatment suggested by Watson and avoid a new clinical trial with not enough history for Watson to analyze. As more doctors follow suit, we risk creating a self-fulfilled prophecy. Allowing AI to be in the driver’s seat can have devastating repercussions to new discoveries, creativity, and innovation. The risk is not AI and robots assuming supremacy. It is rather us making ourselves obsolete. The choice is ours!

5.3

Smart Reporting and Actionable Insights

Any analytics, no matter how advanced, has little value if not associated with clear actionable insights. Many tools are available to generate fancy plots, diagrams, and animated visualization. These tools are available in the free public domain or can be obtained commercially. What the tools fail to reveal is what the plots are telling us, and what actionable insights can be taken. Earlier in the book we defined intelligent information as the actionable insight or the knowledge to act.

5.3 Smart Reporting and Actionable Insights

69

Smart reporting is about demonstrating actionable insights, emphasizing differentiating factors, and highlighting potential causation. For every reporting there ought to be a good story, and clear actions. The following section contains some examples highlighting differences between smart reporting and regular reporting.

Data Context Reporting can be easily tailored to serve an agenda, to misguide an audience, or it can lack any context and serve little purpose. Data context is a crucial detail in any smart reporting. A classic example (Figs. 5.7, 5.8, and 5.9) looks at New England traffic fatalities rate (deaths per 100,000) in five regions for the period 1951–1959. From the plot one can conclude that because of the added police enforcement, the number of traffic deaths has gone down. But is this decrease a real causation or a simple association? A closer look based on a longer time scale (Fig. 5.8) seems to suggest that a similar behavior in reduction of traffic deaths was observed multiple times over a long period of time. Another look (Fig. 5.9), incorporating nearby geographical areas, clearly demonstrates the behavior is not unique to Connecticut. In other words, there is no indication to suggest the observed behavior is unique to Connecticut and caused by stricter enforcement of the law. Before stricter enforcement

325

Connecticut Traffic Deaths, Before (1955) and After (1956) Stricter Enforcement by the Police Against Cars Exceeding Speed limit

300

After stricter enforcement 275

1955

1956

Fig. 5.7 Data context (Ref: Donald T. Campbell and H. Laurence Ross. “The Connecticut Crackdown on Speeding: Time-Series Data in Quasi-Experimental Analysis,” Law & Society Review Vol. 3, No. 1 (Aug. 1968), pp. 33–54.)

70

Fig. 5.8 Adding context

Fig. 5.9 More context

5 Predicting the Future

5.3 Smart Reporting and Actionable Insights

71

Units, Scales, Legends, Labels, Titles, and References Another important topic with smart reporting and a must-have with every chart are units and scales. Figure 5.10 are some of the must-haves. In 1999, a report published by NASA’s Jet Propulsion Lab (JPL) on the loss of the Mars climate orbiter attributed the cause to a simple unit error. Findings in the report indicated that one engineering team from Lockheed Martin Astronautics was communicating their results in English units of measurement (e.g., inches, feet, and pounds), while another team from JPL was using metric units (e.g., cm, meter, and kgs). As a result, the calculated measurements were highly exaggerated which brought the orbiter too close to the planet and caused a loss of communication with the orbiter. No one could have guessed that such a simple unit error would result in such a big loss. Figure 5.11 is an example where a poor choice of scaling may not reveal important data behavior. In the first plot, it is hard to discern any odd behavior. The better choice of scale shown in the second plot reveals a potential odd behavior.

Data Presentation The proper selection of a chart type can help establish possible causality and identify the actionable insights. The opposite is also true. A poorly selected chart type may lead to bad decisions. In Fig. 5.12 the pie chart on the left is a poor choice since all slices are almost comparable in size. It would be hard for an audience to visually distinguish the difference in size in a short time. The bar chart on the right makes it clear what the relative sizes are and the ranking. A great case study (Figs. 5.13, 5.14, and 5.15) on visual presentation of data is found in Edward Tufte’s book titled “The Visual Display of Quantitative Information” (www.edwardtufte.com). It is about the Columbia shuttle challenger explosion in 1986 which resulted in the death of the entire crew on board. An aftermath investigation was launched to identify the cause of the crash. This was followed by a congressional oversight hearing. The investigation revealed that the cause of the crash was due to the failing of the rubber O-rings under cold temperatures. Findings of the investigation were part of a published congressional report (www.govinfogov). So why was this detail missed prior to the launch? The diagram in Fig. 5.14 was an attempt by engineers prior to the launch to argue against launching the shuttle at close to zero temperature which may adversely impact the integrity of the O-rings in the fueling rockets. The temperature value was embedded in each rocket drawing, and in no particular order. Although the engineers had a valid argument, it was hard from the diagram to draw a clear conclusion, and to make a convincing case. As a result, the space challenger was launched the next day close to freezing temperatures, ending in a disastrous tragedy. By rearranging the same data differently, as presented in Tufte’s book and shown in Fig. 5.15, the possible cause–effect between temperature and damage to the

Fig. 5.10 Units and labels

72 5 Predicting the Future

5.3 Smart Reporting and Actionable Insights Bad choice of scale

73 Better choice of scale

100 90 80 70 60 50 40 30 20 10 0

91 90 89 88 87 86 85 84 0

2

4

6

8

10

Fig. 5.11 Proper scaling

Fig. 5.12 Choice of chart type

Fig. 5.13 The Challenger O-rings

12

0

2

4

6

8

10

12

74

5 Predicting the Future

Fig. 5.14 History of O-ring damage

Fig. 5.15 Re-arranged visual display (# Edward R. Tufte, Visual Explanations, p. 45)

O-rings becomes clearer. By highlighting the expected temperature on launch day, it is obvious what actions should have been taken. The good selection of a chart type can help demonstrate possible causality and actionable insights. The opposite is true. A poorly selected chart type can lead to bad decisions with sometimes tragic consequences.

5.3 Smart Reporting and Actionable Insights

75

Fig. 5.16 Classical view

This last example is from my own experience in fintech and big data. It was part of a 5 mn elevator pitch to introduce our startup (SMA) to a large business prospect. The first illustration, shown in Fig. 5.16, is a tree map where the size of a box is indicative of the tweets volume for the stock. The color intensity of a box represents the total sentiment score derived from the tweets, with bright red being too negative (bearish), and bright green being too positive (bullish). In the classical view, it is not surprising the dominant stocks are companies like Apple, JP Morgan, Google, Amazon, and Microsoft. This is because, on any given day they are among the most talked about stocks in social media. There is little added intelligent information and actionable insights from a trader perspective. It is a statement of the obvious. The next picture shown in Fig. 5.17 looks quite different. It is derived from the same data and on the same day. The size of the box is now indicative of the volume change from the norm. Equally, the color intensity is a measure of the sentiment deviation from the norm. The norm is unique to each stock. Companies like Abercrombie (“ANF”) now appear dominant while the trendy stocks like Apple and Google are deemphasized. The contrasting smart view highlighted the unusual effects and the actionable insights, enough to win the client buy-in.

76

5 Predicting the Future

Fig. 5.17 Smart view

5.4

Codeless Coding and Visual Modeling

With smart reporting comes a new trend. Computer programming is undergoing a major transformation. Coding will one day become the task of AI and robotics. The shift is more towards visual modeling, flowcharts, and integrated processes. In the late 1980s the NeXT computer made its debut and quickly became the sought-after computer to work on. The NeXT’s environment provided powerful development tools among which was a graphical user interface builder called Next Interface Builder (NIB). NIB allowed for quick prototyping of a model with simple drag and drop of palettes or objects. Behind the scenes a template (skeleton) of the code was generated (in the object-oriented language Objective-C). The code template can later be manually completed or enhanced to add features. The early development of the world wide web, now almost a household item, was developed at CERN using a NeXT computer. Since then, many more advanced tools were developed based on similar concepts. RapidMiner, with both free and commercial versions, is a visual modeling tool ideal for quick prototyping. With RapidMiner building an application that connects to streaming unstructured data from Twitter and analyzing the content using NLP (Natural Language Processing) becomes a rather simple task of selecting pre-packaged modules and connecting them together. It is like building a remotecontrolled toy car or a robot using Lego blocks. The advent of Raspberry PI (credit card size single board computer at an affordable low price) enables almost anyone to

5.4 Codeless Coding and Visual Modeling

77

Fig. 5.18 MC simulation code in R

build electronic devices, remotely controlled by a mobile phone with a minimum upfront learning curve. With AI it is more urgent to acquire skills in design thinking and visual modeling. Actual implementation of a model, such as coding, will become more part of AI. The need for generalists has become more in demand than experts who are specialized in one narrow field. Some of the skills in visual modeling are: – The ability to identify various conditions, constraints, components, and variables to a problem, i.e. inputs, outputs. – The ability to arrange all components in a process flow diagram and make the proper connections. Before writing any line of code, some of the basic steps in computer science and engineering are to develop a flowchart diagram. A flowchart diagram defines the different states, logical decisions, inputs, and outputs. This exercise can be compared to deriving an entity relation diagram (ERD) when designing a database. The core idea in both cases is to draft a blueprint, to translate a thought process into a visual representation, a logical model, and a flow diagram. A person well versed in coding is not necessarily someone with good design skills. With the advent of machine learning and AI much of the coding can be automated. However, it is much more difficult to automate a thought process and design thinking. To illustrate this point, let us consider the following example. Figure 5.18 is a standard code to run a Monte Carlo (MC) simulation for pricing options as in stocks or equities options. MC simulation can be applied to many other circumstances than finance. The code example in this case is written in R, but the coding language itself is a detail not relevant for our scope. By analyzing the MC code, we can derive a corresponding flowchart diagram. Both analysis and synthesis are essential skills in the age of AI. Figure 5.19 is a

78

5 Predicting the Future

sample flowchart diagram corresponding to the MC code. It was part of a graduate class exercise I taught. Such an exercise can be conducted as part of a class or a training session and does not necessarily require skills in programming. The reverse translation of a code into a flowchart diagram requires skills beyond just coding. Many details are incorporated at each level that makes it easier to comprehend the process and demonstrate proficiency in the topic. It is a mix of skills in analyzing and synthesizing. In the age of artificial intelligence and machine learning, we should be promoting skills in critical and design thinking, creativity, visualization, process orientation, problem solving, and the ability to see the big, interconnected picture. Soon writing a code will be a less necessary skill and will be replaced with visual modeling skills. AI will translate the visual model into an executable code. A hypothetical scenario is represented in Fig. 5.20.

79

Fig. 5.19 MC flowchart

5.4 Codeless Coding and Visual Modeling

5 Predicting the Future

Fig. 5.20 Codeless coding

80

6

The New Company

All lines of businesses including academia are impacted by AI and automation. In the age of AI, education and training will be less about what knowledge we have, and more about what we can do with that knowledge. It is not a competition of man versus machines. It is rather a coexistence of man and machine (Fig. 6.1). Although there is less room for mediocrity, there is ample room for intuition, critical thinking, and creativity. The best way to learn is not by memorizing, reciting, or solving problems via pattern recognition or association to similar problems. Such methods can be easily automated. We need to constantly ask ourselves what is that we know or offer that Google, or a machine cannot? What are our differentiating factors? A good way to learn and to be creative is by experimenting, exploring, and playing with different ideas. Like the play expert and futurist Yesim Kunter advocates, “Play to Innovate®.” In 2017, the Economist dedicated one of its cover page to the topic of lifelong learning and how to survive in the age of automation. It is a constant effort to adapt, renew, learn, and to stay tuned with the evolving market needs. In the age of automation and robots, there is little room for mediocrity, and ample room for intuition, critical thinking, and creativity. It is less about the knowledge we have, and more about what we can do with that knowledge. It is less about competing with robots and more on how to coexist for a better shared future.

6.1

The Mythical Profile

In 1975, a book published by Fred Brooks titled, “The Mythical Man-Month” became very popular among software engineers and project managers. One of the central themes in the book is about resource budgeting. Brooks mocks the notion that # The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 F. A. Harfoush, From Big Data to Intelligent Data, Management for Professionals, https://doi.org/10.1007/978-3-030-76990-1_6

81

82

6 The New Company

Fig. 6.1 Man and machine coexistence

if it takes nine man-months to complete a project, we can speed up the process and complete the project in a month by assigning nine people to it. This was common in software development, where a software developer was ranked by how many lines of codes he/she can write per given time. The prevailing perception in industry at that time was that we can reduce the amount of time to generate x number lines of code by assigning more people to the project. Such a simple metric to measure performance was a common practice. Even today, the desire to identify the super-hero, and attain a rushed solution to a problem is still prevalent. The title “data scientist” is the new trendy buzz in the industry. It appears in many online job ads and in many curriculum vitae (CVs). In the early 1990s, many (including me) transitioned from working at research labs to Wall Street. We were back then called rocket scientists or quants. So, what is exactly data science and a data scientist? Data science can be best described as the intersection of three fields: computer science, math and statistics, and business domain expertise (Fig. 6.2). Each of the three intersecting fields provides valuable skills. There is clearly room for different skills and for many to participate in data science, including the art of visualization and communication. The term STEM which stands for Science, Technology, Engineering and Math is incomplete without Art. The more comprehensive acronym is STEMA. Stemming from the above definition of what data science is, we can describe the profile of a data scientist as the intersection of many skills (Fig. 6.3). In practice someone who excels in all the four areas—mathematics, programming, domain knowledge, and communication—tends to be more the unicorn. Perhaps not surprising domain expertise and soft skills are harder to find and are more in demand in the industry.

6.2 Organizational Structure

83

Fig. 6.2 Defining data science

Fig. 6.3 The modern data scientist

6.2

Organizational Structure

Corporate organizational structures, logistics, red tapes, and compartmentalization inhibit efficiency and the ability to see the big picture. The new enterprise is more adaptive and dynamic to quickly respond to changing needs. Organizational

84

6 The New Company

structures tend to differ from centralized to decentralized, and from matrix to functional. In data science and business analytics, there are few important considerations when deciding which organizational structure is best. Data science is a multidisciplinary field. Tackling problems in data science is best conducted through a well-integrated team effort. In a matrix-like organizational structure, having a separate team dedicated to data science within different Lines of Businesses (LOB) can be difficult to setup, an unnecessary duplication of efforts, and inefficient. A more functional approach is to create a central unit, independent of the lines of businesses, dedicated to data science and analytics. Skills in mathematics, programming, database, and communication can be transferable and assigned to a different LOB. The one skill not easily found and transferable is the domain knowledge. Domain knowledge can reside within the LOB. When a new project in data science needs to be tackled within a particular LOB, a team from the central unit can be deployed within the LOB for the duration of the project and work closely with the resident business domain expert. A project leader from the LOB would be assigned to manage progress and deliverables for the duration of the project.

6.3

Software and Technology

Over-reliance on specific tools will alter our perception of the possible solutions and limit a company’s agility. It is important to recognize what each tool’s capabilities are, their strengths and weaknesses, and how to best leverage them. As the saying goes, “if your only tool is a hammer, then every problem looks like a nail.” For example, R and Python are best for statistical and data analysis and working with very large data sets. Both are freely available in the public domain (open source) and are the most popular programming languages nowadays in data science. Similar commercial tools are SAS, SPSS, and Stata. Other tools like QlikView, Tableau are best tailored for visual analytics with the drag and drop ease of manipulation, connectivity to different data sources, and free public sharing of interactive presentations. IBM Watson Analytics is best known for its cognitive analytics, and speed of processing. RapidMiner and KNIME (free and commercial versions) provide a workflow interface to build applications, and are best for visual design and quick prototyping. The type of tools needed vary depending on the nature of the problem to solve, and where the problem resides in the end-to-end process flow. Having a good understanding of what tools to use, when to use, and for what purpose, are some of the key questions to ask when solving a problem.

6.3 Software and Technology

85

Fig. 6.4 Multi-stage tools

Fig. 6.5 Sentiment analysis process flow

There are three main stages to consider in a typical process flow. Some tools may allow for some cross-stage capabilities but are best recognized for their core capabilities (Fig. 6.4). Figure 6.5 depicts a sentiment analysis process flow using feeds from Twitter. In the context of this book, software should be treated as a tool to solve a problem and not the end itself. Solutions vary between the commercially available and the free open source. In the age of big data and rapid technology evolution, there is urgency for companies to be agile and rapid in developing a solution. Open source codes can adapt quicker, and for this reason, their adoption in the industry has grown rapidly. There are pros and cons. Adoption of open source solutions carries some risks (i.e., support and maintenance) and requires sometimes specialized skills. However, they allow for the rapid prototyping, and quicker time to delivery. A good analogy to an open source solution is when playing with Lego shapes and modules. There are no limits on possibilities and how far the imagination can go— from the simplest Lego model to the most advanced (Fig. 6.6). A comparison of open source solutions versus commercial is best summarized by the penguins in Fig. 6.7.

86

6 The New Company

Fig. 6.6 Lego like modular solutions

Fig. 6.7 Open source solutions versus commercial

Choice of tools varies between the free open source and the commercially available. The former allows for the freedom of choice and the quick adaptation relying on internal skills for support and maintenance. Commercial solutions come with support and maintenance at the expense of less adaptability, less freedom of choice, and slower agility.

7

Data Ethics: Facts and Fiction

There was a time when companies like Microsoft promoted and differentiated their offerings with the slogan “what you see is what you get.” A lot has changed since then. It is fair to assume that we live more and more in a world where “what you see is less than what you get.” The separation of facts and fiction is much blurred and harder to distinguish. Thanks to advances in AI technology, like deep learning and deep fake, it is possible to create almost any new [fake] reality. Figure 7.1 shows a photo generated completely by AI. The photo is created from scratch and does not refer to a real person. It is almost impossible to distinguish if the photo is real or fake. Other examples include face swapping of real persons in pictures or in videos. One can imagine the unlimited possibilities with deep fake, the undesired implications and the risks. It is not uncommon for 99% of the data published on social media or similar channels to be irrelevant, manipulated, or simply garbage. Still 1% of big data is a sizable amount of relevant data. But which 1%?

7.1

Virtual or Fake Reality

Virtual reality, augmented reality, deep learning and deep fake are dramatically altering the world we live in. Discerning between fiction and reality, between intuition and rationale, between the relevant and the irrelevant—at a time when decisions are to be made almost instantaneously—are major challenges and almost impossible tasks. Fake news, alternative facts, misleading news, altered news, fact checkers are now commonly used terms. Below are a few examples to help illustrate the differences. Terms like facts, alternative facts, and fact checking became popular during recent US presidential debates in 2016 and 2020. Figure 7.2 provides a simple illustration. The left picture can be described as a splash from a can, while the right image implies a hand with five fingers. # The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 F. A. Harfoush, From Big Data to Intelligent Data, Management for Professionals, https://doi.org/10.1007/978-3-030-76990-1_7

87

88

7 Data Ethics: Facts and Fiction

Fig. 7.1 Fact or fiction (https://generated.photos)

Fig. 7.2 Alternative facts

Example of misleading news is a photo posted in 2017 by a fake account making a false claim by suggesting a Muslim woman indifference to a terror attack in Westminster (UK), to promote far-right ideas. Although the photo was real, it was later explained by the man who took the picture that the woman was traumatized and avoiding looking at the horror surrounding her. Another example is news out of context. In 2016 following NFL player Colin Kaepernick’s kneeling during the national anthem to protest oppression and

7.1 Virtual or Fake Reality

89

Fig. 7.3 Socrates’ triple filter test

injustice, kneeling became a symbol of protest. A photo published by FOX news incorrectly portrays an NFL player kneeling in protest when in fact he was praying prior to the game. Later FOX apologized for the misleading photo. Altered news is another example On September 9, 2011 President Bush was informed of the tragic 9/11 terrorist act while on a visit to an elementary school in Florida. A picture circulating on social media depicts the President reading a book upside down. It was later revealed the picture was clearly altered. So how to assess authenticity of information before sharing? A good recipe can be found in a story and life lesson attributed to the Greek philosopher Socrates. Even though Socrates lived around 400 BC, the recipe is still valid nowadays. It is called the triple filter test, portrayed in Fig. 7.3. The story goes like this: one day a disciple of Socrates came running and agitated with important news to share. Before the disciple was able to share the news, Socrates asked him to calm down and answer three questions. First question was “Are you absolutely sure that what you’re going to tell me is true?” The disciple could not be hundred percent sure, so he could not say yes. Second question: “Is what you’re going to tell me good or not?” In fact, what the disciple was about to share was sad and going to cause distress. Third question: “Is what you have to say going to help me?” The disciple did not know if the information was useful or not. In the end Socrates replied, “If what you want to tell me isn’t true, isn’t good, and isn’t even useful, why would I want to hear it?”

90

7 Data Ethics: Facts and Fiction

Fig. 7.4 Spotting fake news

The moral of Socrates’ triple filter test story, applicable to nowadays, is for us to be less passive transmitters of information, and to be more active assessors of information before sharing. The below infographic (Fig. 7.4) published by the International Federation of Library Associations (IFLA) provides a good recipe to spot fake news before sharing. It is another example of steps each one of us can take to help reduce or slow down the propagation of fake news.

7.2 Privacy Matters

91

Confronted with a new challenge of discerning truth from fake, it becomes a shared responsibility and imperative on each person to assume their part in reducing the propagation and the amplification of fake news and information. Otherwise, we all pay the price.

7.2

Privacy Matters

With the advent of 24/7 real-time streaming data, issues of privacy present serious new concerns and challenges. At a time when technology is advancing at a rapid pace, and there is no shortage of data, questions of privacy and ethics have been very lacking. The main concern is that many of these questions remain unanswered while technology leaps fast into the future. When we say data ethics, we also think of data privacy. Although the two are intertwined, there are big differences. Data privacy is concerned with the protection and the legal implications in exposing personal identifiable information (PII). Data ethics is more about the proper handling of data to preserve and promote a responsible and better engagement with our surroundings from people to the environment. Although the fundamentals of ethics have not changed, we have new ethical problems associated with data. When addressing matters of data privacy there are two competing arguments. Consider the example of having to do with customer relations. We are often frustrated by dealing with companies that do not “know” us as individuals, and that do not remember anything about the last interaction we had with them. We tire quickly of having to repeat ourselves. On the other hand, we are particularly leery of companies asking us for personal information before we have decided that we want to do business with them at all, “can I have your phone number and zip code.” That is like asking for your birth date and income before applying for a mortgage! Another example has to do with healthcare information. In an emergency you want the doctor to be able to quickly access your hospital records and obtain information on the critical things they need to know most. On the other hand, a pharmaceutical company that studies the adverse reactions to a drug should not compromise the privacy of patients, by requesting or accessing unnecessary private information. There are two competing arguments to data privacy. Many times, we may need to trade privacy for better and more tailored service. One way to balance privacy and the need for information is through “anonymized data.” Anonymization is stripping personally identifiable information (PII) from the data, and allowing the individuals associated with that data to remain anonymous.

92

7 Data Ethics: Facts and Fiction

Fig. 7.5 Patient anonymized table

Some of the type of data that can be considered private, and should be subject to special handling are: – Personally identifiable information such as name, social security number (SSN), driver’s license (DL), email, telephone – Sensitive information such as race/ethnicity, political views, religious beliefs – Other information such as cookie ID, static IP address. A simple example of data anonymization is shown in Fig. 7.5. The address can be a private detail as it may allow the extraction of personal information. In general, the more the data is anonymized, the less utility it has. So, what is the purpose of anonymization? Many times, it is to fulfill a requirement, and satisfy legal terms. Although it makes it harder to guess the identity, there is no full protection. With enough effort and access to multiple sources of data, it is possible to connect the dots and trace the data to a particular user. This is known as profiling and it is one of the most concerning and contested privacy topics. The European GDPR (Global Data Privacy Rule) which went into effect in April 2018, clearly sets the rules against profiling. What is exactly profiling? Profiling is intimately related to the concept presented earlier in this book on connecting the dots. It is when the whole is greater than the sum of its parts or put simply 1 + 1 > 2. By connecting the dots, we can build a more comprehensive picture of a person's profile including private information such as race, ethnicity, gender, political affiliation, religious belief, etc. There are no universally agreed upon rules on how to protect privacy. Most companies tend to develop and implement their own privacy rules. There are however some basic principles most companies must adhere to and outlined in many documented accords such as the US-EU Safe Harbor on

7.3 Data Governance and Audit

93

safeguarding personal information. The European GDPR is to date the most serious attempt to put in place rules for the protection of private data, and the widely accepted. So, what are some of the main activities associated with data privacy, which every business dealing with data should conform to or seriously consider for better business conduct and ethics. – Transparency: Inform individuals for what purpose the collected information will be used for. – Choice: Offer the opportunity to opt out. – Consent: Only disclose personal data information to third parties consistent with the above and subject to established rules. – Security: Take responsible measures to protect personal information from loss, misuse, unauthorized access, disclosure, alteration, and destruction. – Data Integrity: Assure reliability of personal information for its intended use with reasonable precautions, and ensure information is accurate, complete, and current. – Access: Provide individuals with access to personal information data about them. – Accountability: Firm must be accountable for following the rules and assuring compliance. With increasing data breaches, we should expect to see more insurance policies to compensate users in case of compromise of private information.

7.3

Data Governance and Audit

Most institutions, private, public, for profit, and not-for-profit [should] have rules in place governing how data is handled, an oversight group to ensure compliance, and to conduct internal audits. Data sourced from commercial vendors are subject to licensing agreement defining the terms and conditions on how the data can be used within an organization. Data governance, sometimes referred to as data management, sets the rules on the data’s who, what, where, when, how long, and why. Per industry standards some of the main questions belonging to data management are: – Retain: What data can be stored internally and for how long? What is the expiry date and aging policy attached to the data? Expiry date is the length of time for which data can be stored at the client site. Aging policy is about how often the entire data set, or a section of the data is refreshed. For example, we may obtain a data set that expires in two years and where the data is renewed every three months. In some cases, data can be used to derive insights from, but the data itself cannot be stored internally. And if stored, it may be for a short period of time, like two weeks rolling time window. At any given time, only the two most recent weeks of data is stored. Any earlier data is

94



– –





7 Data Ethics: Facts and Fiction

deleted. Such details are part of a licensing agreement; and if broken it can have serious legal ramifications. Archive: When the data is deleted or archived? Such details are part of licensing agreements and internal firm regulations. Data may also have to be deleted due to storage constraints. The archiving of data is many times part of both storage limitations and performance improvement. As data gets old and is less used, it is more efficient to maintain a limited data set on a rolling base. The reason is, it is easier and faster to search for a needle in a smaller haystack than a big one. Share: How the data can be disseminated within an organization? Sharing is more about what data can be made visible or public within an organization and with whom. It is not about access privileges. Merge: What data can be aggregated or connected? This is many times banned by the prime data providers. It can be viewed as an attempt to profile or to create a higher data value proposition which would be in competition and a conflict of interest. Access: Who can access what data? Another important detail in data management is authorization and credentials. The former is about authenticating a user access to a data repository or a database. Once a person is authenticated the next step is to assess a user’s credentials to access what data. For example, a user working at a financial institution may have access only to a selected set of portfolios, but not others. Most of the time, not everyone has access to all data within an organization. Some of the rules might be set by the provider of the data and some by the organization itself. Deciding who has access to what and at what level of data granularity is done programmatically by setting rules for credentials and account privileges for individuals and group members. Imagine, for example, we have identified two types of users and two types of data categories. This leads to four (2  2) different combinations of credentials and privileges. The number of combinations can easily grow depending on the different levels of users’ credentials and data privileges. It is not uncommon to have 100 (10  10) combinations. Programmatically this is best implemented by using bitwise logic. React: Who explicitly manages the data? What can be edited, corrected and by whom? Not everyone within a firm may have the right to manage the data, and not all data can be modified. By managing we mean ensuring the data is handled in ways that best serve the needs of a firm, while complying with the licensing agreement, and other internal rules set by the firm. The other question is what data can be acted upon or modified, such as correction or editing. Many times, such acts are not allowed and if they are, the information or the result of the act should be shared with the provider. This is to ensure the integrity of the data is maintained across all consumers. To maintain the integrity of the data as it is moved from one source to a destination, it is paramount to have rules in place that define what data can be edited, modified, or corrected and by whom. These rules serve, in case of a data

7.3 Data Governance and Audit

95

quality bridge, to investigate the source of the problem and to assign responsibility when needed. Audit, in general, is the ability to trace back and to replay events that occurred in the past. The historical time span can vary per conditions. In the USA, the IRS (internal revenue services) rules for auditing tax filings can range from 3 to 6 years back. Events can be in relation to financial reporting, decision making, data modeling, data processing, etc. Audit assumes there are records in place (mainly in digital form) that capture the what, the when, the how, and the who. It is like having a snapshot frozen in time. Our interest here is in the data and derived analytics. In many institutions, especially financial ones, model development falls under strict oversight rules set by an internal model governance group. Any change to a model is subject to strict internal rules and guidelines. Part of the guidelines, and to meet audit requirements, is the ability to reproduce the same modeling conditions (i.e., data, model parameters, assumptions made) that led to results at a certain time in the past. With 24/7 streaming data, auditing as we know it must be redefined. Under AI and machine learning both the data and the model are constantly changing and adapting. Recreating the past, for auditing purposes, means ability to reproduce every instantaneous moment of the past. Almost an impossible task! What data and what model were used to produce a particular outcome are not easily reproducible with streaming data. Let us consider Twitter data. A business subscriber to Twitter data can store all the tweets used internally to run the analytics and derive the actionable insights. The stored tweets are subject to usage licensing agreement. They can be used internally but not distributed. The business can only monetize and distribute their analytics, like sentiment scores. For auditing purposes, a model that incorporates machine learning will need to reprocess the data and reproduce the same sentiment scores as in the past. This is feasible, albeit not easy, if all the data and the conditions for the model can be recreated. If, on the other hand, the data is disposed of, once the analytics are created it is almost impossible to revert in time. Depending on the nature of the business, the rules for auditing can vary. If the analytics can seriously jeopardize a business by putting millions of dollars at risk (true in the financial industry), having strong oversight rules and internal auditing capabilities are a must. Assume now that auditing has determined the model was not correct. Any correction to the model needs to also reproduce all historical results that were already published.

96

7 Data Ethics: Facts and Fiction

7.4

Who Owns the Data

Data is often referred to as the new oil or the new commodity. Unlike traditional commodities, such as oil or natural gas, data exhibits unique characteristics. For example, data is easy to replicate. Once also created, it can almost last forever. Another important characteristic is data can be stolen or hacked without leaving a trace. All these characteristics present new challenges on how data must be handled. Contrary to a home invasion or a material theft, where tangible traces can reveal a malicious act, the forensic evidence in case of a cyber theft are many times impossible to determine. Cyber security and data theft are like someone invading your home, stealing most of your valuable items, leaving no trace behind, and conducting the operation while you are home enjoying a nice time with family and friends. Cyber security and data forensics are one of the fastest growing fields with high demand for talents. Content is king. This is true for any enterprise in the business of generating content like giant media corporations. With data available in abundance, the question of who really owns data remains. Many examples in recent years like the Snowden case, WikiLeaks, Panama files, and Anonymous Hackers have raised serious legal questions on who owns the data. The case of Cambridge Analytica and Facebook made the headline news in 2018. Users’ profiles and posts were basically shared with a consulting company for research purposes without users’ knowledge and consent. The case ended in court and Cambridge Analytica stopped operations the same year. Sharing private information is not a new phenomenon and dates back probably to the beginning of mankind (Fig. 7.6). We are all in a way accomplice in the crime of sharing private information. The simple act of sharing an image on social media can be an invasion of privacy. The notion of chatter, gossip, and water coolers have for a long time defined many decisions and shaped many strategies. Like any other commodity, data is traded, exchanged, and sold by brokers including what may be considered sensitive data. Without strong regulations in place on how data should be handled, and who owns the data, the playing field is

Fig. 7.6 Chatter and water cooler

7.4 Who Owns the Data

97

Fig. 7.7 Selling data by the pound

wide open for brokers to operate almost freely. Buyers of data range from advertisers, marketers, other brokers, and even governments. Most of it is probably harmless and anonymized consumer data. Many times, the data is collected without direct knowledge of the consumer, embedded in fine prints when an app is downloaded or vaguely stated. What is being collected and sold can cover a wide spectrum of information such as religion, ethnicity, income, and medical history. Also included are details such as alcoholism, depression, cancer, heart disease, and even sexual orientation. Such details can be deduced from purchases made, online sites visited, and bars or gyms frequented. When purchasing data, the question is how to value the data. The simplest way is to set a price based on the “weight” of the data, as depicted in Fig. 7.7. By weight we simply mean the size of the data as measured in bytes. This is common when, for example, phone companies or Internet providers charge subscribers a monthly fee based on the number of bytes consumed in a month. It makes sense when what matters is the amount of data traffic over a network (landline or wireless) and not the data itself. The question is trickier when the interest is the data itself. The price of the data now depends on the quality of the data and the added value to a business. So how do we measure this? Many factors come into play depending on the nature of the business, the quality of the data, the breadth, and the depth of coverage. Let us assume we are interested in acquiring financial historical data for the period of 2005–2015. This could be intraday tick data or end-of-day data. The price will be different in each case. Let us say we are interested only in a smaller set of data when the market crashed in September of 2008. The price of the data for 2008 will be different than, for example, the year 2006. The more impactful the data is, the higher

98

7 Data Ethics: Facts and Fiction

we should expect the price to be. Also, of relevance is the quality of the data. Is the data edited, corrected, patched, etc.? Are there missing data or duplicate data? Such details will impact the price of the data and the value added to a business. Accounting for these details requires good data investigative work. Institutions, big or small, may not have the resources to carefully assess the quality of the data they are billed for. There are also no legal liability issues which the provider or seller should be concerned about to guarantee the quality of the data. In almost all cases data is provided as is, with a footprint to use at your own risk. It protects the provider from any legal consequences due to quality or misuse of data. In a more digitized environment and online interactions data is constantly collected. Facebook, Instagram, TikTok, Twitter, and Uber, all provide a platform for publishing and/or for communicating. Their biggest asset is not necessarily the platform, but the data collected. The collected data is subject to privacy rules, often written in legal terms hard for a lay person to interpret. The more users are active on the platform, the more data is generated. The moment users migrate away and stop using the platform, the business will fail and cease to exist. So how does the business liquidate its assets if the only asset they have is data? The laws governing business assets are different in case of a bankruptcy and a liquidation than when the business is operating normally. Consider the case when RadioShack filed for bankruptcy in 2015. Part of the assets liquidated were the millions of emails, names, and addresses of customers, as well as potential information on customers’ shopping habits. Both Google and Facebook leave open the possibility of customer data being sold, as part of asset liquidation under such terms. The laws governing business assets are different in case of a bankruptcy and a liquidation than when the business is operating normally. In answering the question of who owns the data it is therefore important to understand the legal circumstances and possible ramifications. The creator of a content is in a way the owner. Any monetization of a content, whether by selling data or filling the media channels, should have the permission of the owner. Any revenue generated from the content should in principle be shared with the owner of the content. It is not however always feasible to associate a content with a particular owner. What about association at the data granular level of bits and bytes? Such details are easily lost in the transmission of data. Unless there is a way to tag each transmitted bit or byte with a particular user it is almost impossible to keep track and trace the data to the original source. Attempts have been made in that regard but not at the scale to be widely implemented and adopted.

7.5 The Coming of COVID-19

7.5

99

The Coming of COVID-19

The COVID-19 virus pandemic, which started in early 2020, marked a major global health crisis of proportions not seen in recent history. Even when the pandemic is over, some of the effects it had on our lifestyles, work habits, and global interactions will be here to stay. Big data played a major role in tackling the virus spread and raised many questions on privacy and ethics. Monitoring, contact tracing, and quick intervention have proven to be very effective in China where the virus first emerged. With little concern to matters of privacy, the movements and activities of every citizen were traced, and a risk score (like a credit score) was calculated. This required active participation from every individual to install an app on their mobile phones. In a country where the mobile phone is widely used for daily activities to make payments, to check on services, and to track locations valuable data is being collected. The risk score of every individual was updated in almost near real time. Accordingly, the person is allowed or denied access to services, events, work locations, and other privileges. Equipped with such information, it became easier to contain the spread of the virus. The idea is very similar to the social metric score China introduced prior to the pandemic. The infrastructure was already in place. In other earlier examples, dating apps have been developed to alert a person when they come in proximity with a person sharing similar habits and interests based on their mobile phone profile. Many are willing to exchange privacy matters in return for better services, particularly when related to health matters. One implication of COVID-19 is the wider use of digital payments instead of cash. More and more individuals are opting to make payments with their mobile phones, and businesses are refusing to accept cash transactions. The adoption of digital payments using mobile phones is a convenience that comes at the expense of more data collected. Unlike China, countries in the west have been less reluctant and less quick to adopt similar measures without some checks in place to prevent the misuse of collected data for other purposes. The European GDPR (General Data Protection Rule) protects the individual and warns against the use of collected data for profiling. But once a data is created, it is almost impossible to guarantee there is no copy of the data somewhere else, unless there is a way to tag the data and trace its journey. Data privacy becomes a mere legal shield to protect the individual, making it inadmissible to use the information in legal courts. It is effective when everyone or the majority is abiding by the law. The risk is always there in case of ill-intentions or when the data falls into the wrong hands.

8

Role of Academia, Industry, and Research

8.1

Revamping Academia

Academic institutions have for long been compared to ivory towers where change comes slowly. To some degree this is true, but we need to differentiate between three major roles for academia: providing quality education, building skills relevant to industry, and conducting fundamental research work. How do we also measure a school’s success? Is it by number of published papers in highly rated journals? Is it by student population and number of graduates? Is it by revenue generated? Is it by the number of recent graduates with job offers? Is it by the size of a supporting and successful alumni community? Unfortunately, very few statistics are available to track students’ success after graduation. The answer is probably a combination of all above. The COVID-19 pandemic forced many schools to go online. The move exacerbated an existing problem and brought it to surface. It became a wakeup call for many parents and students to ask the question: What is my return on the investment (ROI)? With so many other online options available for less price (sometimes free) and of high quality, it is harder to justify the school tuition for an online class. Are schools mostly about the campus experience, sports, and living outside the home? What about the content and preparedness for real-life experience? The many schools’ loss of revenue due to the pandemic was a problem waiting to happen. The pandemic was simply the catalyst and not the root cause. Since going online, some schools have been investing in ways to deter or prevent students from cheating, such as proctoring and requiring cameras to be turned on to monitor students. It has become a cat and mouse game, with some students finding ways to deceive technology further widening the digital divide between those who can and those who cannot. It is a misplaced focus when schools engage in the business of policing and monitoring students. The focus should instead be on providing a high quality and a competitive content to justify the costs. Schools need to adapt and revamp their offerings to coexist with Google, Alexa, smart watches, and the Internet. They should train students in problem solving, # The Author(s), under exclusive license to Springer Nature Switzerland AG 2021 F. A. Harfoush, From Big Data to Intelligent Data, Management for Professionals, https://doi.org/10.1007/978-3-030-76990-1_8

101

102

8

Role of Academia, Industry, and Research

critical thinking, creativity, seeing the big picture, and leveraging the freely available vast number of resources to learn how to solve problems. The age of memorizing, reciting, paying a high premium for knowledge that can be found on Google is over. Students should learn how to work and collaborate with AI and Robots. In a digital age where we are constantly bombarded with data and information, what matters most is not what knowledge we have, but what we can do with that knowledge. This requires a new thinking and a new approach to teaching. Access to Google or the Internet during a test should be the norm. Companies like Google have developed their own new programs and certificates to help participants acquire skills much needed in the industry. The courses take a short time to complete (a few months) and cost a fraction of a traditional college education. Certificates from Google, Amazon, and alike can become a preferred alternative to certificates offered by universities. The advent of MOOCs and the many online courses offered at a low cost, some taught by reputable professors, are already changing the landscape, and forcing many schools to rethink their business value proposition. Top tier schools with the name credibility will have the advantage compared to lower tier schools who lack the branding. In a competitive market, and an industry more interested in applied skills, many schools will have a harder time to justify their continued existence. It is time for a serious change.

8.2

Bridging the Gap

For years national research laboratories in the USA and elsewhere played, and continue to play, a pivotal role in the advancement of technology and science that have improved the lives of millions. Among the many national labs in the USA are NASA Jet Propulsion Lab JPL in California, Sandia in Albuquerque, MIT Lincoln Lab in Lexington, Lawrence Livermore in Livermore, Fermi and Argonne Labs in Chicago. With funding from the US government such as the Department of Defense (DoD) and Department of Energy (DoE), these labs are less concerned with shortterm profits, and more focused on long-term scientific research work. Most operate as part of a collaborative network of scientists and researchers with other national labs globally and academic institutions. Many discoveries at national labs eventually find their way into industry and are commercialized for wider adoption. The Internet, for example, now almost a household utility, was initially a project funded by DARPA (a DoD agency) to connect US national labs. First version of the world wide web (WWW) was discovered by scientists at CERN.

8.2 Bridging the Gap

103

Fig. 8.1 The disconnect

Fig. 8.2 Towards an ecosystem

Technology transfer from open national research labs, funded by the government have for years, particularly in the USA, allowed for advancements in science and technology benefiting industry and the society at large. Contrary to national labs and academia, corporations and startups are driven by short-term goals where the bottom line is to maximize shareholders’ return on investment. This disconnect in goals and methods of work is not in the best interests of both ends of the spectrum (Fig. 8.1). A close interaction between academia, research labs, industry, and entrepreneurial startups creates an ecosystem that is a win-win for all (Fig. 8.2). The push towards such an ecosystem has been accelerating with industry participation. Private corporations and startups are investing money establishing in-house research labs allowing for the same creative and entrepreneurial exploration of new ideas.

104

8

Role of Academia, Industry, and Research

National labs and academia are tapping into specific talents found in industry for business management and applied skills. Prior to acquiring Chase, JP Morgan formed JP Morgan Labs whereby different entrepreneurial ideas benefiting the financial sector were explored. The labs operated semi autonomously. The idea being even if one of ten projects proved to be successful, it was enough to justify the costs and be rewarding in the long term. Some of the advanced analytics currently in use by major financial institutions came out of spin offs from JP Morgan Lab. Research work conducted at major universities are often supported by grants from private and non-private sectors ranging from government agencies to industry corporations. The different and sometimes conflicting needs of students, academia, and industry raise some challenges many universities continue to struggle with (Fig. 8.3). Top tier universities are best equipped to meet those challenges. An evolving model for schools is where existing skills are augmented with the necessary learning credentials and training to best match the needs of the business or industry while also allowing for fundamental research work (Fig. 8.4). In industry it is common to find people with good programming skills but lack the business knowledge and vice versa. For a business major, coding is more of a Blackbox and a tool to extract the intelligent information or actionable insights from the data to drive decision making. For a computer science major coding is the goal. It is not about the business context and the actionable insights. Same applies to other fields like quants and data scientists. Without a good understanding of the business framework, no matter how good the tools are, and advanced the analytics are, we will end up with a fancy solution for the wrong problem. The disconnection between the business understanding and the analytics can be very costly to a company. This realization has led to the development of new interdisciplinary programs combining, for example, coding skills with business knowledge or vice versa. Such interdisciplinary programs are sometimes referred to as CS+X with CS standing for Computer Science and X a field in business, anthropology, biology, or art. Those interested in advancing their programming skills are best served at a liberal and arts school with a strong CS department. Those interested in acquiring domain expertise are best served in a business school with a strong department in Finance or Marketing. Advanced analytical skills are best acquired in Engineering, Mathematics, or Statistics school. The fields of data science, data analytics, business analytics are very interdisciplinary in nature. Schools successful in incorporating interdisciplinary programs provide the best of what each department has to offer, and best serve the needs of students and of industry.

Fig. 8.3 The most wanted

8.2 Bridging the Gap 105

106

8

Role of Academia, Industry, and Research

Fig. 8.4 Integrated individualized solution

8.3

STEAM for All

Science, Technology, Engineering, and Mathematics (STEM) without the Art (A) are incomplete. Promoting and cultivating critical and analytical thinking is not about being STEM oriented or non-STEM. Analytical thinking is not about being a rocket scientist, mathematician, statistician, or a computer programmer. It is not unusual to encounter someone with good computer programming skills or even advanced analytics but lacks the analytical mind. In contrast, one can come across someone with a business or non-STEM oriented background such as art, psychology, history, or literature with great analytical thinking and problem-solving skills. So how to define the analytical mind? It is less about specialty in the tools and more

8.3 STEAM for All

107

Fig. 8.5 Curious and investigative mind

about what to do with the tools to solve problems. Per many published reports, the most valuable and in demand skills in industry are (Fig. 8.5): – – – – –

Having a curious investigative mind Problem-solving skills, affinity to data, the unknown Enjoy playing with puzzles and exploring new ideas Good intuition, critical thinking, innovative and creative mind Capable to see the big picture and to connect the dots Science, technology, engineering, and math are not complete without the arts. Arts allow us to see things and connect the dots at a level that cannot be expressed by a formula or a computer program.

Most of these skills are not easily learned, and not easily programmable for AI. Therefore, they are in such demand. When solving a problem, there are four realizations we need to be mindful of. They are called the four quadrants of the “matrix knowledge” (Fig. 8.6). 1. The “what you know that you know” (known knows) is relatively a happy scenario. We simply need to leverage properly our knowledge base. 2. The “what you know that you don’t know” (unknown knows) realization itself helps narrow the scope of a problem. We need to use proper tools to investigate and resolve. 3. The “what you don’t know that you know” (known unknowns) makes a bit harder to develop a solution. We need to learn, explore, and develop tools to identify. 4. The “what we don’t know we don’t know” (unknown unknows) not much we can do about this. With some help we can hope to identify those.

108

8

Role of Academia, Industry, and Research

Fig. 8.6 The four quadrants matrix knowledge

Most problem solving and investigative work involve analyses that are qualitative and quantitative in nature. Qualitative analysis aims to gather an in-depth understanding of the underlying reasons and motivations for an event or an observation. It is more visual, and graph oriented. Useful for exploratory research—earliest stage of analytics. It is more about the shape of the curve—the general trend—rather than the actual values. Given a large amount of data a qualitative visual inspection is always recommended to gain quick insights. Quantitative analysis refers to the investigation of phenomena via statistical, mathematical, and computational techniques. It aims to quantify an event with metrics and numbers. It is more explanatory in nature. Any analysis should aim to reconcile the findings from both the qualitative and the quantitative methods. Any discrepancy is a sign of a conflict and should be investigated. Causes, if not rectifiable, should be attributed to well explainable reasons.

8.4

A Capstone Template

The final section of this book offers a template for a capstone project, or a practicum. It is designed more for the business oriented, based on topics discussed in this book, and is well suited for industry applications. The capstone project follows the 3  3 stages presented earlier in Sect. 1.12 on framing, solving, and reporting. The scope is to demonstrate abilities to tackle a problem involving large data sets from inception to delivery, to leverage available tools, reconcile qualitative and quantitative analysis, and infer the actionable insights to drive a decision. The capstone project report is organized in three major parts.

8.4 A Capstone Template

109

1. Executive Summary A half-page summary of the project outlining the rationale benefits and key findings. Although it is the first part of the report, the executive summary is written when all work is completed. This is the most read part of a report, and often the only part. It is comparable to a five-minute elevator pitch where key takeaways are presented to obtain the buy-in from stakeholders. It should be concise and compelling. 2. Business Requirement This is where the problem is framed: description of the problem with clear identification of the issues to be addressed, how the data will help, and the expected added value to the business that can drive decision making. Most importantly this is where the targeted audience and stakeholders are identified. This is where the data story is prepared. In a bigger real-world context such details are part of a more elaborate business requirement document (BRD). A BRD include additional details on project timeline, critical dependencies, milestones, deliverables, and resourcing. Some questions and activities this section attempts to answer are: a. Summary of previous work already published b. What is new in the proposed work c. Targeted market sector d. Technology, i.e. tools, database, software e. How the data and the analysis can help gain a competitive edge f. Potentials for monetization and revenue generation 2.1. Data Sizing and Description This section is about sizing the data, defining data access methods, understanding the terms and conditions for licensing, how the data is sourced and collected. Some questions and activities this section attempts to answer are: a. Licensing terms and conditions b. Format of the data file, i.e. CSV, XLS, HTML, JSON c. Access or delivery method, i.e. API, ftp, download d. Type of data: text, numeric, categorical, structured, unstructured, etc. e. Nature of data: historical, recent daily data, time range, proprietary, public, etc. f. Size of the data file in bytes, number of rows and columns g. Description of all columns, with clear separation of the optional nice-to-have data fields and the must-have fields

110

8

Role of Academia, Industry, and Research

2.2. Data Understanding, Analysis and Preparation This is about understanding and describing the data entity relationships, the required fields, identifying the data cleanup steps, and defining the business logic rules for data quality and integrity. This is about demonstrating data domain expertise and building confidence in the data before any modeling. This is where most of the data groundwork is done and tends to be the most laborious and time consuming. Some questions and activities this section tries to address are: a. Explain the business nature of the data, how the data was generated, collected, the units, the relations, etc. b. Identify the steps to pre-process the data like unit conversion, cleanup of undesired characters or symbols, calculation of new data, etc. c. Assess quality of the data such as missing entries, duplicates, outliers, and ways to remedy. d. Derive an entity relation diagram (ERD) and relational schemas suitable for a database implementation. e. Define business logic rules to check for data field integrity violations. f. Define business logic rules to check for data field relationship integrity violations. g. Conduct a preliminary descriptive analysis of the data, both the quantitative (statistical description of data) and qualitative (charts, scatter plots) types. h. Identify interesting patterns in the data, and potential actionable insights. 3. Modeling and Solving This is where the model selection and assessment of model goodness are discussed. It is assumed that all the groundwork with the data is completed and the data is in its best possible shape. 3.1. Model Selection In this section, a model (i.e., regression, classifier) is developed using a training data set (part of supervised learning). Simple/multilinear and non-linear regression models (i.e., quadratic) must be considered. The analysis should include description of methodology, distinction of dependent and independent variables, assessment of model goodness both qualitatively and quantitatively, summary of results, and justifications for final model selection. The qualitative analysis considers the visual characterization of the model goodness, while the quantitative analysis looks at measurable quantities to determine goodness of the model. It should be possible to push the envelope and explore other solutions believed to help with the overall decision-making process.

8.4 A Capstone Template

111

3.2. Demonstrating Value and Accuracy In this section, we assess the correctness of the model by applying it to a testing data. The assessment must compare results from both the qualitative and quantitative analysis and demonstrate reconciliation. Any discrepancy should be explained and attributed to a justifiable cause. Finally, the analysis should demonstrate the value of the model helping in better decision making, areas of deficiency, and/or of improvement. 4. Reporting and Communicating Results This is the final section summarizing the results and the actionable insights. Key outcomes should be included here and some in the executive summary. We want to be articulate, concise, clear in making a compelling case, and getting the buy-in from the readers or more precisely the stakeholders. In this section, we also rely more on visual analytics and assume the audience or reader has little or no background in the topic. We want to avoid complex analytics, explain the complex in simple forms while avoiding the over-simplification. This is where we connect the dots, build the big picture to create a convincing data story. Here we also discuss potential future work, and possible improvements to enhance the model and draw more insights. This is the final data story!