226 53 7MB
English Pages 77 [79] Year 2020
Why AI/Data Science Projects Fail How to Avoid Project Pitfalls
iii
Synthesis Lectures on Computation and Analytics This series focuses on advancing education and research at the interface of qualitative analysis and quantitative sciences. Current challenges and new opportunities are explored with an emphasis on the integration and application of mathematics and engineering to create computational models for understanding and solving real-world complex problems. Applied mathematical, statistical, and computational techniques are utilized to understand the actions and interactions of computational and analytical sciences. Various perspectives on research problems in data science, engineering, information science, operations research, and computational science, engineering, and mathematics are presented. The techniques and perspectives are designed for all those who need to improve or expand their use of analytics across a variety of disciplines and applications. Why AI/Data Science Projects Fail: How to Avoid Project Pitfalls Joyce Weiner
iv
Copyright © 2021 by Morgan & Claypool All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews, without the prior permission of the publisher. Why AI/Data Science Projects Fail: How to Avoid Project Pitfalls Joyce Weiner www.morganclaypool.com
ISBN: 9781636390383 print ISBN: 9781636390390 ebook ISBN: 9781636390406 hardcover DOI 10.2200/S01070ED1V01Y202012CAN001 A Publication in the Morgan & Claypool Publishers series SYNTHESIS LECTURES ON COMPUTATION AND ANALYTICS Lecture #1
v
Why AI/Data Science Projects Fail How to Avoid Project Pitfalls Joyce Weiner Intel
SYNTHESIS LECTURES ON COMPUTATION AND ANALYTICS #1
M &C
MORGAN
& CLAYPOOL PUBLISHERS
vi
ABSTRACT
Recent data shows that 87% of Artificial Intelligence/Big Data projects don’t make it into production (VB Staff, 2019), meaning that most projects are never deployed. This book addresses five common pitfalls that prevent projects from reaching deployment and provides tools and methods to avoid those pitfalls. Along the way, stories from actual experience in building and deploying data science projects are shared to illustrate the methods and tools. While the book is primarily for data science practitioners, information for managers of data science practitioners is included in the Tips for Managers sections.
KEYWORDS
data science, project management, AI projects, data science projects, project planning, agile applied to data science, Lean Six Sigma
vii
Contents Preface���������������������������������������������������������������������������������������������������� ix 1
Introduction and Background��������������������������������������������������������������� 1
2
Project Phases and Common Project Pitfalls ��������������������������������������� 5 2.1 Tips for Managers������������������������������������������������������������������������� 10
3
Five Methods to Avoid Common Pitfalls���������������������������������������������� 13 3.1 Ask Questions ����������������������������������������������������������������������������� 14 3.2 Get Alignment ��������������������������������������������������������������������������� 14 3.3 Keep It Simple ��������������������������������������������������������������������������� 14 3.4 Leverage Explainability �������������������������������������������������������������� 15 3.5 Have the Conversation ��������������������������������������������������������������� 15 3.6 Tips for Managers ����������������������������������������������������������������������� 16
4
Define Phase������������������������������������������������������������������������������������������ 19 4.1 Project Charter ��������������������������������������������������������������������������� 19 4.2 Supplier-Input-Process-Output-Customer (SIPOC) Analysis ��� 23 4.3 Tips for Managers ����������������������������������������������������������������������� 28
5
Making the Business Case: Assigning Value to Your Project��������������� 29 5.1 Data Analysis Projects ��������������������������������������������������������������� 30 5.2 Automation Projects ������������������������������������������������������������������ 31 5.3 Improving Business Processes ����������������������������������������������������� 32 5.4 Data Mining Projects ����������������������������������������������������������������� 33 5.5 Improved Data Science �������������������������������������������������������������� 33 5.6 Metrics to Dollar Conversion ����������������������������������������������������� 33
6
Acquisition and Exploration of Data Phase������������������������������������������ 35 6.1 Acquiring Data �������������������������������������������������������������������������� 35 6.2 Developing Data Collection Systems ����������������������������������������� 35 6.3 Data Exploration ������������������������������������������������������������������������ 37 6.4 What Does the Customer Want to Know? �������������������������������� 37 6.5 Preparing for a Report or Model ����������������������������������������������� 38 6.6 Tips for Managers ����������������������������������������������������������������������� 38
viii
7
Model-Building Phase��������������������������������������������������������������������������� 41 7.1 Keep it Simple ��������������������������������������������������������������������������� 41 7.2 Repeatability ������������������������������������������������������������������������������ 42 7.3 Leverage Explainability �������������������������������������������������������������� 42 7.4 Tips for Managers ����������������������������������������������������������������������� 43
8
Interpret and Communicate Phase ������������������������������������������������������ 45 8.1 Know Your Audience ����������������������������������������������������������������� 46 8.2 Reports ��������������������������������������������������������������������������������������� 47 8.3 Presentations ������������������������������������������������������������������������������ 49 8.4 Models ��������������������������������������������������������������������������������������� 51 8.5 Tips for Mangers ����������������������������������������������������������������������� 52
9
Deployment Phase��������������������������������������������������������������������������������� 53 9.1 Plan for Deployment from the Start ������������������������������������������ 53 9.2 Documentation �������������������������������������������������������������������������� 54 9.3 Maintenance ����������������������������������������������������������������������������� 55 9.4 Tips for Managers ����������������������������������������������������������������������� 56
10
Summary of the Five Methods to Avoid Common Pitfalls������������������ 59 10.1 Ask Questions ��������������������������������������������������������������������������� 59 10.2 Get Alignment �������������������������������������������������������������������������� 59 10.3 Keep It Simple �������������������������������������������������������������������������� 60 10.4 Leverage Explainability ������������������������������������������������������������ 60 10.5 Have the Conversation �������������������������������������������������������������� 60
References���������������������������������������������������������������������������������������������� 63
Author Biography���������������������������������������������������������������������������������� 65
ix
Figures Figure 4.1: Example project charter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.2: Supplier-input-process-output-customer (SIPOC) analysis table. The SIPOC is completed in three parts following the numbered steps .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.3: Example SIPOC for engineering dispositioning material . . . . . . . Figure 4.4: Example SIPOC with both Part 1, Process, and Part 2, Output and Customer completed. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Figure 4.5: Example SIPOC with Part 3 Suppliers, Inputs started.. . . . . . . . . . Figure 4.6: Example SIPOC with all parts completed.. . . . . . . . . . . . . . . . . . . . Figure 8.1: Example presentation slide. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
22 24 24 25 26 27 50
Tables Table 1.1: Five project pitfalls�������������������������������������������������������������������������� 2 Table 1.2: Alignment between data science project phases and Lean six sigma DMAIC framework ���������������������������������������������������������� 3 Table 3.1: Connection between the methods to avoid pitfalls and the five project pitfalls������������������������������������������������������������������������������� 13 Table 3.2: Questions to ask at retrospectives��������������������������������������������������� 17 Table 4.1: Key components of a project charter ��������������������������������������������� 20 Table 5.1: Deliverables and metrics for various types of data science projects����������������������������������������������������������������������������������������� 29 Table 5.2: Example calculation for time saved ����������������������������������������������� 31 Table 5.3: Types of waste with manufacturing and office examples ��������������� 32 Table 5.4: Common metrics and dollar conversion����������������������������������������� 33 Table 8.1: Data science project types and typical final deliverables����������������� 45 Table 8.2: Data visualization reading list ������������������������������������������������������� 48
xi
Preface Who is this book for? To answer that question, I need to give a little background. My degrees are in physics. I have a physics undergraduate degree and a Master’s degree in optical science. In physics there are two disciplines: theoretical physics and experimental physics. Similarly, I have observed that in data science, there are theorists who focus on developing algorithms, and practitioners who use algorithms and apply data science. This book is primarily for data science practitioners. I’ve also included information for managers of data science practitioners in the Tips for Managers sections. This book is organized into chapters. The first two chapters introduce common project pitfalls and the methods to avoid them. The next chapters are based on project phases for data science projects. In each chapter of the project phase chapters, I’ll include tools you can use to help avoid the project pitfalls, highlighting which of the five methods the tool supports. To get you started, the five methods are: (1) ask questions; (2) get alignment; (3) keep it simple; (4) leverage explainability; and (5) have the conversation. Throughout this book I use the term “data science projects” as an all-encompassing term that includes Artificial Intelligence (AI) and Big Data projects. AI itself is an inclusive term. Machine learning and deep learning are both types of AI. AI is defined by the Oxford English Dictionary as the theory and development of computer systems able to perform tasks that normally require human intelligence, such as visual perception, speech recognition, decision-making, and translation between languages (Oxford Languages, 2020). That means it encompasses any method of computation that mimics human intelligence, not only what we often think of—machine learning and neural nets—but also expert systems and optimization algorithms. I’m using “customer” to mean the person getting value from the project and the end user of the project. I’m using “management” to mean your leadership and decision-making chain within your organization.
1
CHAPTER 1
Introduction and Background At the INFORMS Business Analytics Conference in 2018, one of the speakers shared a statistic that 85% of Artificial Intelligence (AI) and Big Data projects failed. Looking into it, I learned that in this case, failure was defined as “not being deployed.” So, starting a project that didn’t make it to production was the definition of the project failing. I was surprised by the number. 85% is a large percentage. I could think of a few reasons why a project wouldn’t get deployed. Maybe you didn’t like the results from the models you tried, or you needed to collect some more data. But those reasons wouldn’t account for such a large percentage. That 85% of AI and big data projects would fail indicates a systematic problem, not something specific to a particular project. In following up after the conference, I found an article by Venture Beat (VB Staff, 2019) that increased the failure percentage to 87%. Now I was very interested and started to think about why. Why is it that 87% of AI/big data projects would fail before they reached deployment? What were the systematic problems that would cause such a high rate of failure? The Venture Beat article reports on a panel session with Deborah Leff, CTO for data science and AI at IBM and Chris Chapo, SVP of data and analytics at Gap at Transform 2019. During the panel session, Leff and Chapo discussed reasons why only 13% of data science projects make it into production. In the session, they mentioned the need for leadership support, access to data, collaboration across teams, long-term ownership of solutions, and keeping it simple. The reasons the panelists shared resonated with me, but I wasn’t fully satisfied because some of these problems are not within the control of the data scientists or others doing the projects. I started thinking about systematic challenges that can prevent a data science project from reaching deployment that could be controlled by the people doing the work. I’ve been a data scientist for 25 years. That is not to say that my job title has been “Data Scientist” for that amount of time. When I started, no one talked about Data Science. I needed to use an entire paragraph to explain my area of technical expertise. This paragraph covered data extraction, analysis, visualization, and modeling. Nowadays, I can just say I’m a data scientist and people understand what I do.
2
1. INTRODUCTION AND BACKGROUND
Over my career, I’ve worked on all sizes of data science projects. I’ve built small reports, dashboards, and large predictive models. For any project there are risks or pitfalls that can cause a project to fail that are in the control of the people working on that project. The five pitfalls are: 1. the scope of the project is too big; 2. the project scope increased in size as the project progressed—e.g., scope creep; 3. the model couldn’t be explained, hence there was lack of trust in the solution, 4. the model was too complex and therefore difficult to maintain; and 5. the project solved the wrong problem (Table 1.1). These are all systematic problems and can be addressed with a good framework. Table 1.1: Five project pitfalls
1 2 3 4 5
The scope of the project is too big The project scope increased in size as the project progressed (scope creep) The model couldn’t be explained The model was too complex The project solved the wrong problem
An area of interest for me throughout my career has been to use data to drive efficiency improvements. I began my career working as a process engineer in manufacturing. While working in manufacturing, I became interested in process improvement and earned my Lean Six Sigma black belt.1 Lean Six Sigma uses the DMAIC framework of Define, Measure, Analyze, Improve, and Control as a strategy for improving processes. During a talk I attended on data science project phases, I realized that the phases for data science projects (1. Define project objectives; 2. Acquire and explore data; 3. Model data; 4. Interpret and communicate; and 5. Implement, document, and maintain) lined up with the DMAIC framework (Table 1.2). This made me think that some of the tools from Lean Six Sigma would be very helpful in overcoming the five pitfalls.
1
If you are interested in learning more about Lean Six Sigma, an excellent overview is What is Lean Six Sigma? (George, Rowlands, and Kastle, 2003)
1. INTRODUCTION AND BACKGROUND 3
Table 1.2: Alignment between data science project phases and Lean Six Sigma DMAIC framework
Data Science Project Phases Define project objectives Acquire and explore data Model data Interpret and communicate Implement, document, and maintain
Lean Six Sigma DMAIC Framework Define Measure Analyze Improve Control
Getting back to the Transform 2019 panel session, the reasons they gave for projects to fail were: need for leadership support, access to data, collaboration across teams, long-term ownership of solutions, and keeping it simple. Of these reasons, leadership support, access to data, and collaboration across teams are related to communications, and having management alignment. Keeping it simple is just that—simple. Sometimes not easy, but simple. Long-term ownership of a solution is a matter of planning—planning for deployment and building for maintainability. There is one other potential problem that the panelists didn’t mention. That is lack of understanding and therefore discomfort with a model. If you can’t explain why a model is predicting a particular outcome, and having that explanation is important to management or the customer for the model, then your project will fail. Fundamentally, treat a data science project the way you would any other project. A data science project is not a quick and dirty one-off skunk works kind of thing—if you want it to be deployed. To get a project to production you need to plan it and do up-front work. If you go slowly at the start of a project, and do the work on defining the problem, getting alignment with the end customer for the project, then you will only solve the problem once. By going slower up front you can end up going faster in the end. So, what exactly can you do to avoid project pitfalls? There are five methods altogether. 1. Ask questions. 2. Get alignment. 3. Keep it simple. 4. Leverage explainability. 5. Have the conversation.
4
1. INTRODUCTION AND BACKGROUND
Asking questions enables communications and helps start the process of getting management alignment. Asking questions up front ensures you are positioned to start a project that will deliver results that the customer wants. Asking questions fosters a collaborative atmosphere which will help if you need to get assistance from other teams. Explaining your intentions and documenting the project helps with alignment. Having metrics supports accountability. You can use them to request help and resources and get support from management. Aligning with management and getting support from the start of the project will ensure that management is aware of your project and can help if needed later. Keeping it simple prevents a project from becoming so big that it can never be finished. It also prevents problems with maintaining a project. If it is simple to explain, and simple to execute, it can be simple to transfer to a different owner who can sustain it long term. Starting with a simple problem allows you to build a solution and then decide if adding on is needed or desired. This is the crawl, walk, run methodology. Leveraging explainability is tied to both asking questions and getting alignment. If your management isn’t that comfortable with AI, they may prefer models with very clear connections between the inputs and outputs. Knowing this in advance will allow you to select the type of model that works for the project and meets their criteria. In some cases, they may not care at all. This is important to know in advance and asking questions and getting alignment helps to ensure you will build something that your management and customer is happy with. Lastly, all of these are about having the conversation. As a project goes along there are many decision points. If you know what the items of key importance are for your customer in advance, and have alignment with your management, you can make these decisions quickly and move the project forward. Even with up-front engagement and alignment you will need to continue to have conversations with management and your customer as the project progresses. The best way to handle this is to understand that this need will occur and plan for it by having regular check-ins with both your customer and management.
5
CHAPTER 2
Project Phases and Common Project Pitfalls Let’s take a more in-depth look into the reasons for projects not to get to production. While there are many reasons, and clearly this must be true because 87% of projects don’t make it, I’ll cover 5 reasons that are systematic in nature. These reasons are: 1. the scope of the project is too big; 2. the project’s scope increased in size as the project progressed—e.g., scope creep; 3. the model couldn’t be explained, hence there was lack of trust in the solution; 4. the model was too complex; and 5. the project solved the wrong problem. We’ll explore each one and discuss them in detail. The first reason for a project not to get to deployment to production is that the scope is too big. If we can’t fully get our arms around a project, it becomes difficult to plan and is hard to finish. If we are overly ambitious and get ahead of our competency or beyond the ability of the team working on a project, we can be slow to finish or not finish at all. This is analogous to the do-it-yourself TV shows which rescue a couple with no building expertise, after they had gotten in over their heads. Sometimes the couple jumps into demo without planning the project and runs into a problem. Sometimes they pull down a load bearing wall because they didn’t do the work ahead of time to investigate. Then, they must pay for experts to come in, assess the situation, and correct the problems. This leads to increased time and expense to complete the project. The same thing can happen with data science projects. We can bite off more than we can chew. Typically, we don’t pull the roof down on our heads. We just fail to deliver the project to production. The problem with having too big a scope is that it is difficult to detect unless you are checking in with your customer on a regular basis. Having too big a scope is a common pitfall because everyone wants to deliver good work, and people are generally
6
2. PROJECT PHASES AND COMMON PROJECT PITFALLS
ambitious. “Sure,” we say, “we can have it do this and that, and that other feature, and automatically control everything.” The problem then becomes if you go off to build that all as one piece. If you do that, it puts you at risk of not finishing within the customer’s desired timeframe because it has become a really huge project. It also puts you at risk of another pitfall, solving the wrong problem. The solution to a big project is breaking it into smaller deliverable pieces that can be put into production as you go—incremental development. This gives you a chance to check in with the customer. A benefit of an incremental development framework like in Agile development2 is that this check is built into the process. Avoid waiting until the very end to show your project to the customer and get feedback. If you build incrementally and share as you go, you can learn if they will be satisfied by a smaller scope. I’ve only had two cases where I’ve ever scoped a project too small, and that was easily corrected by increasing the scope based on what we had learned so far. So, having too big a scope can be addressed by asking questions at the start of the project, and by delivering incrementally and asking questions at each delivery. As in the case of the do-it-yourself couple, ask the equivalent questions to “What’s in this wall?”: Where are the boundaries of the project? Does it feel big? Can it be broken into pieces that can be delivered separately? To give you an example of breaking a project into pieces, let’s look at the case of taking a manual process and adding AI. To do this, you need data and a data pipeline. If those are not yet available, that needs to be the first part of the project. You need to automate the process and establish data collection. Then, later, you can add AI. Trying to do it all at once just sets you up for failure. It’s also clear from this example why having alignment is necessary. Without that conversation there would likely be a misalignment in expectations for the project deliverables. While you would know that you need to first set up automation so data for a future AI model can be collected, management might only see that this project is taking forever and may pull the plug. By having the conversation with management and getting alignment on the project, you not only prevent the project from being canceled, but also get recognition for the value delivered from automating a manual process as part of the project deliverables. The second reason for a project to fail is scope creep. In this case, you start the project off fine and then decide to add some more features. Additional features lead to complexity. Complexity can lead to bugs in code. Complexity can also cause problems with deployment, which can cause a project to fail. Scope creep can cause you to not finish a project because there is no defined “end.” If you decide in advance what the first-pass accuracy criteria will be for the 2
http://www.agilemanifesto.org/
2. PROJECT PHASES AND COMMON PITFALLS 7
model, you won’t fall into the trap of polishing it endlessly. The same applies to features of the overall project. The thing to keep in mind to fight scope creep is that you can always go back and make improvements later. Having a working model in deployment is better than not having anything. As my Lean sensei has quoted, “don’t let perfect be the enemy of good.” In other words, don’t hold off putting a working model into deployment because it isn’t perfect. To fight scope creep, it’s important to define the project’s objectives and scope up front. You can always go back and revise these as you learn, but you then have guidelines to keep you to a smaller project size that you can deliver to production. Say, for example, I am going to be building a report. To help get alignment with the customer, I’ll have a meeting to understand the decision they want to make based on the report and what their needs are. Then, before I start pulling data or writing code, I’ll draw out a mockup of the report and share it to get their feedback. I do this drawing on paper or on a whiteboard so we can iterate and make changes real time in the meeting. Once what the report should look like has been established, I have a good feel for the scope of the project. If the customer calls me and asks for a new feature before I’ve deployed the first version, I’ll suggest we wait for the first version to go out, and then see if the feature is needed. Depending on what they are asking, we can have a conversation about what currently planned feature to swap out for that new feature. One way to establish a boundary on scope is to have a fixed timeline for delivering the project. In Agile development we learn about the iron triangle of project management: resources, time, and features. Two out of the three are fixed, the last one is flexible. Traditionally, resources and features are considered “fixed” and then you have problems with slipping timelines and projects that never end. In Agile, resources and time are fixed, the features that go into a project are flexible. If you set a fixed timeline for the project, i.e., a two-week sprint with a product release at the end of each sprint, and adjust the features you deliver to fit inside that timeline, you can prevent scope creep. You also have to keep to the rule that you can’t add any features unless you have completed all the features you initially agreed to, and you still have time left, or, that you learned something about what the customer wants and will swap one feature for another. The third reason for a project not to get to deployment is that the model couldn’t be explained and there is management (typically) concern about how it works and lack of comfort in the solution. To avoid this pitfall, keep it simple. The simpler a model is, the easier it is to explain. While it might be exciting for you, as the data scientist, to use the latest model that you were recently reading about, it may not be the best solution for the project. It can also backfire if that latest model
8
2. PROJECT PHASES AND COMMON PROJECT PITFALLS
doesn’t have the explainability that your customer desires. To avoid this pitfall, have the conversation in advance with your customer. Do they need to know the reasoning behind and input values that influenced a given prediction, or are they ok with predictive values? Will they want to know causality or just correlation? In my work, I prioritize using machine learning over neural nets, and physical models over machine learning, because physical models are the easiest to explain, followed by machine learning algorithms, while neural nets can be a black box. If there is a known equation for the process you are working on, there is no need to get fancy—just use that physical model. When I was working in semiconductor manufacturing, we used physical models to predict critical dimensions. We created a system that compared the calculated measurement based on the input parameters to actual measurements and made a feedback loop to automatically adjust the lithography machine. This is an example of AI using a simple physical model. Keeping with models, the fourth risk for your project not to get to deployment is that the model is too complex. Again, the solution to this risk is to keep it simple. The simpler the model, the faster the calculation will be for inference. The simpler the model, the faster it will be to train, or if it is really simple, no training will be needed at all. Additionally, simpler models are easier to maintain. As you are building a model, think about the number of parameters needed for inference. Will that data be available every time your customer will want a prediction? How hard or easy is it to gather the data needed? The simpler the model, the fewer parameters it has. The fewer parameters needed, the easier it is to ensure all the required data is available. Think about how long it takes to train your model (assuming you aren’t using a physical model). How often will the model need to be retrained and how long will the training take? If a model takes many hours to build—say a deep learning model—but you only need to retrain infrequently, that might work. If it takes many hours to train the model, and you’ll need to retrain monthly, that may not work. These are some of the questions you’ll need to ask to gather this information. How often do the input parameters change? How often does the process that the model supports change? Understanding these parameters and having the conversation with the customer of the model about training frequency is important to ensure your project gets to deployment. The fifth risk for your project is that you solve the wrong problem. Of the five reasons, solving the wrong problem is the most devastating for a project team. After working long and hard on a project, getting to the end, only to learn that you haven’t delivered a helpful solution is difficult. It is also not that uncommon.
2. PROJECT PHASES AND COMMON PITFALLS 9
A Principal Engineer where I work did a study of problem-solving task forces looking at “why we were slow to get to root cause.” The number one reason was that the teams working to solve the problem jumped to a conclusion with insufficient data. The number two reason was that the teams solved the wrong problem. The key to avoiding the pitfall of solving the wrong problem is to go slow. That’s a challenging piece of advice if you are addressing why a team was slow to get to root cause in the first place, but hear me out. When you solve the wrong problem, once you realize it, you still need to continue to work to solve the actual problem. Most likely what has happened is that you’ve addressed a symptom or superficial need, but not addressed the root cause or fundamental want of the customer. So, you end up solving the problem at least twice. Solving the wrong problem can be avoided by having conversations and asking questions up front to make sure you have alignment with your customer. Take time at the beginning of a project to understand the problem and what the customer really wants. I cover problem statements more deeply in Chapter 4. Albert Einstein said, “If I had an hour to solve a problem, I’d spend 55 minutes thinking about the problem and 5 minutes thinking about solutions.” I don’t intend this quote to be a recommendation on the ratio of thinking about a problem versus thinking about solutions. I do want to caution you about jumping into a project without first spending time thinking about the problem. It may seem like you are going too slowly by taking this up front time. However, if you can’t afford the time up front, you certainly can’t afford the time to do it twice. This is called going slow to go fast. I learned about going slow to go fast by participating in autocross. Autocross is a timed race with only one car on the track at a time. Usually, autocross events take place in an empty parking lots with the course marked by chalk lines and traffic cones. Your score is determined by the time it takes for you to navigate the course, and a penalty is given for any cones you hit, and for going off course. Autocross is more about handling than about horsepower, so there are frequently tight turns, and almost always a slalom. Hence, the going slow to go fast. If you try to take a curve too quickly, you will likely take out a few cones and increase your time by quite a lot. If you go slower through the curve, you can make it without hitting any cones and achieve a competitive time. Applying the concept of going slow to go fast to data science projects means thinking about, and planning for, deployment at the start of your project. As data scientists, it is tempting to jump into the model building phase of a project as quickly as possible. You know, just do the minimal data cleaning needed and get right to modeling. The risk of this approach is that it sets your project up for all the pitfalls and makes it likely to be one of the 87% that don’t make it to deployment. Instead,
10 2. PROJECT PHASES AND COMMON PROJECT PITFALLS
deliberately proceed through the project phases: Define project objectives; Acquire and explore data; Model data; Interpret and communicate; and Implement, document, and maintain. Check in with your customer throughout the project to make sure you have alignment, and keep management updated on your progress and what has been delivered so far. Keep things simple and iterate. This ensures you have an achievable scope to your project. Iterating on the project means you can deliver value as you go without having the full scope of the project finished.
2.1
TIPS FOR MANAGERS
Two of the project pitfalls are related to scope. Scoping problems are easy to fall into. Help your team manage the scope of a project properly. To do this, ask questions at the start of a project to aid in scoping it correctly. • Who is this for? • What is the minimum that they need? • What needs to be done to deliver that minimum? Keep reminding your team that the customer gets a benefit from delivery of a project into production, even if it is not perfect. Help them move a project to deployment before beginning improvements so your organization can start to receive the business value while your team works on the next version. If you are familiar with Agile development and the Agile Manifesto, this is applying the first principle: “Our highest priority is to satisfy the customer through early and continuous delivery of valuable software” (Beck, 2001). Applying this principle helps prevent scope creep and makes sure you are seeing business value as quickly as possible. Coach your team to assess scope creep by asking the following questions. • Is the report/model currently working? • What is the minimum work needed to get it working? • Are we maximizing value delivered for the time we are spending on the project? • Can we get value from it as is? • Can we put this in production? Why not? Asking why can help you gain understanding from your data science team and coach them to think through whether they are over processing—meaning putting in
2.2 TIPS FOR MANAGERS 11
more effort than is actually required. Lean Six Sigma suggests asking why five times. In practice, I have found it best to ask why until you get to a fundamental constraint like, “that’s how physics works.” Another two of the pitfalls relate to models. Make sure your team discusses with you the type of model they intend to use. Help them evaluate the customer’s and the business’ need for explainability and select an appropriate model that meets those requirements. Ask questions to gain understanding about the model to be used, so that you are comfortable with the model and can help explain to others the method your team is using. Help your team avoid building models that are not tied to reality and avoid over processing in model building. It can be tempting for data scientists to work on improving model accuracy indefinitely. Help your team ensure that they are not chasing diminishing returns. The last pitfall relates to solving the wrong problem. Two things will assist in avoiding this pitfall. (1) Work with your team to ask questions at the start of a project. (2) Support your team in getting alignment with customers and stakeholders. Guide your team to go slow to go fast. Make planning both what will be done and how it will be done a normal part of working on a project. Ask questions about how the team is planning the project and the work. Ask from the very start how they plan to deploy the project. Make sure they are thinking about the end state as they build things. Allow them to iterate rather than wait for perfection to deploy a solution.
13
CHAPTER 3
Five Methods to Avoid Common Pitfalls So, what should we do up front to make sure we have fully defined the problem and will not fall into one of the common pitfalls? There are five methods: Ask questions; get alignment; keep it simple; leverage explainability; and have the conversation. Each of the five methods addresses multiple pitfalls, as shown in Table 3.1. Table 3.1: Connection between the methods to avoid pitfalls and the five project pitfalls
Method
Ask Questions
Get Alignment Keep it Simple Leverage Explainability
Have the Conversation
Avoids These Pitfalls Scope is too big Scope creep Model couldn’t be explained Solved the wrong problem Scope is too big Scope creep Solved the wrong problem Model couldn’t be explained Model was too complex Model couldn’t be explained Solved the wrong problem Scope is too big Scope creep Model couldn’t be explained Model was too complex Solved the wrong problem
14 3. FIVE METHODS TO AVOID COMMON PITFALLS
3.1
ASK QUESTIONS
You need to start a project by asking questions. For example, what is the desired outcome? What is your customer looking for? What’s the goal of the project? Once you understand the answers to these questions, you can build on them to ask additional questions and gain clarity. Asking questions isn’t limited to just the start of the project. Starting off by asking questions gets you on the path to delivering a successful solution to production. Continuing to ask questions as you work on the project keeps the project on that path. Asking questions up front prevents having too big a scope and can prevent scope creep. Asking questions about expectations can help ensure the model won’t be too complex and will ensure you use a model that your customer is comfortable with. Asking questions helps guarantee you will work on the correct problem. In Chapter 4, I’ll share two tools that will help you ask these questions in the first phase of your project. These tools can be used to check in with your customer, management, and key decision maker as your project progresses.
3.2
GET ALIGNMENT
In Chapter 4, I’ll share a tool that helps get alignment with your customer and with your management before you start the project. Getting alignment with your customer up front means that you will have agreement on what the project entails. Together you’ll decide on the scope of the project, and what “done” looks like. I can’t emphasize enough the benefit of have a clear definition of what it means that the project is done that is agreed to by the customer. This alignment reduces the risk of scope creep and of solving the wrong problem. Having alignment with your management makes sure they understand what benefits and deliverables will come from the project. It means that you have prepared them for future requests for support or resources. In Chapter 5, I provide examples of business value delivered by different types of data science projects, and a table of conversions to dollars for common metrics.
3.3
KEEP IT SIMPLE
The next method to avoid project pitfalls is to keep it simple. This is something to keep in mind through all the phases of the project. It is true for both model building, and for the overall project. Added complexity can be a form of scope creep and can prevent your project from being deployed to production due to bugs.
3.5 HAVE THE CONVERSATION 15
If you can get a simple version of the project deployed, you can always go back and refine the model or add features. You get a benefit and deliver business value from having the project deployed. Added complexity delays that benefit. In a model, complexity can cause difficulty in the ability to maintain the model. Increased complexity often comes with increased time required to train a model. I’ll discuss this in more detail in Chapter 7. Keep deployment of the project simple. If there is an existing system, use it to deploy your project. If there are existing business processes, make sure your project works within them, unless the intent of the project is to change them. I’ll cover this topic in more detail in Chapter 9.
3.4
LEVERAGE EXPLAINABILITY
Part of the reason for asking questions and getting alignment is to understand your customer’s needs and wants. Then, you can ensure the model you use is in alignment, especially to your customer’s need for explainability. This prevents the customer having second thoughts or last-minute concerns which risk deployment of a solution. The same goes for management. Even if your end customer is ok with a “black box” type solution, your management may not be comfortable providing it. This is less of a concern if you are working for internal customers, but if your end customer is the public, your management may want explainability. Or they may not. Ask them and find out before you build things. Take advantage of the industry interest in explainability. This is a hot topic in the field. There is a lot of research in this area, and I anticipate there will be new methods and solutions in the future (Royal Society, 2019).
3.5
HAVE THE CONVERSATION
Throughout the project, from the very beginning, it is important to have an ongoing conversation with the customer. This will help prevent all five project pitfalls. You’ll be able to set the scope of the project properly and avoid scope creep by checking in with the customer as you go to share the status of the project and what can be done at increments. You can make sure you are keeping the project at the right level of simplicity, so it does what is needed without excess. Your customer will have working knowledge of the model and familiarity with how the solution works through repeated exposure during your conversations. Finally, you can be sure that you’ve solved the correct problem because you’ve checked in with the customer as you went.
16 3. FIVE METHODS TO AVOID COMMON PITFALLS
Feature requests and other suggested improvements are opportunities to have further conversations about the project. As part of these conversations, make sure you are asking about timelines as well as the desired outcome from the change. All too often we hear customer requests as “must dos” but frequently the customer doesn’t really know exactly what they want, or even what is actually possible. A portion of the responsibility in delivering a finished project is to help the customer articulate what it is that they actually need. The best way to get this to happen is to have a conversation with the customer. Ask clarifying questions and use a whiteboard or paper to create mock-ups so they can see what to expect and correct you if you aren’t fully understanding what they are trying to get. At the end of a project, take time to reflect back and collect your thoughts on what you learned. Review the project charter and compare what you expected would happen to the actual outcomes. It is helpful to have a final review with your customer to share the results of the project, including delivered business value. At that review, you can get your customer’s feedback on the project to incorporate into your reflection. Cover both what you would want to change in future projects and what went well on this project. This helps you learn and grow as a data scientist.
3.6
TIPS FOR MANAGERS
Support your team in asking questions at the start of a project and throughout the process. Facilitate meetings between stakeholders and your team or represent your team in meetings with stakeholders. Facilitate conversations between your team and the customer of the final output of the project. Make sure your team is checking in with the end customer regularly and involving them in decisions—like how the final report will look, what graphs will be used, and how the customer will interact with the model when deployed. Ensure that there is a strong connection to customers and at the same time that the team is protected from excessive new feature and change requests. The best way to ensure a strong customer connection while protecting your team is to establish structured meeting times to share the latest iteration with the customer and to collect feedback. Help your team in receiving feedback. Sometimes a critique can be challenging to hear, and if your team becomes defensive, then they lose the opportunity to learn and improve. Communicate with the customer and stakeholders that your team needs uninterrupted time to work and that requests for changes and new features will only be accepted in the established review and feedback meetings. If you are using Agile
3.6 TIPS FOR MANAGERS 17
Development, have the team’s Scrum Master help with protecting the team from distractions, and require that your team only accept user stories from outside the team if they come via the Product Owner. Check in with your team—Are they keeping it simple? Ask them about what the cheapest and quickest solution would be. Coach them to choose simple methods. Make it OK to put a solution into production and then go back and improve it in the future. Train your team to learn from past projects by performing retrospectives. A retrospective is time where the team reflects on a project that has been deployed and compares the project charter to actual results. Ask what was supposed to happen and what did happen during the project. Ask what they have learned, and what they would like to change or continue in the future. Keep it positive. This is about learning, not recriminations. Table 3.2: Questions to ask at retrospectives
What was supposed to happen? What did happen? Was there a difference? Why or why not? What did we learn? What do we want to change or continue? The first time a team does a retrospective, they will be nervous that you will be looking to find fault. Finding fault is counterproductive and will not help the team improve and grow. Make it safe to discuss failure and examine what happened without placing blame on individuals. Keep in mind that your team is doing their best. No one comes to work wanting to mess up. To quote W. Edwards Deming, “A bad system will beat a good person every time” (Deming, 1993). Support your team to think about where your organization’s systems and business processes are holding them back and take steps to make improvements. Start with small things your team controls. If you only change one thing after that first retrospective as a result of their feedback, they will feel empowered and engaged in the process, and future retrospectives will be very fruitful.
19
CHAPTER 4
Define Phase The first phase of a project is where you set the scope and determine the deliverables—what the outcomes of the project will be. As discussed previously, it is important to do this up-front work because it prevents future problems that can cause your project to fail. Work done in this step ensures you have alignment with both management and the end customer, and that the scope of the project is defined. This protects against scope creep and having too big a scope. The questions you ask and the conversations you have during the define phase of the project protect against solving the wrong problem. There are two tools from Lean Six Sigma you can use to help ensure you ask the questions, get alignment, and have the needed conversations at the beginning of the project. These tools are: (1) a project charter and (2) a Supplier-Input-Process-Output-Customer (SIPOC) analysis.
4.1
PROJECT CHARTER
The project charter is a living document that establishes the parameters around the problem you are working on. It helps you identify who the stakeholders and final decision makers are. Writing a project charter means that you get alignment on the following questions: What’s the problem you are solving? How will you know you solved it? What is in and out of scope? What is the business benefit that will be delivered? Who is involved in the project? Who is the decision maker? It might be tempting to overlook this step or do a quick pass only. The benefit of doing this work at the start of the project is that it forces you to have the conversations and get alignment on boundary conditions, customers and stakeholders, and expected outcomes. This will allow you to make decisions quickly as the project progresses, because you already have that information and don’t need to seek alignment during the later phases of the project. The reason the charter is a living document is that often there are things you learn as you progress in the project that may require a change to the problem statement or modification of the project scope. Having these written down means you notice when the scope has changed, which triggers a conversation with the customer and
20 4. DEFINE PHASE
decision maker. This keeps the alignment between the project as delivered and the customer’s expectations, which ensures your project will make it to production. The format of the charter document isn’t as important as its content. The key components of the charter are problem statement, scope, how to measure success of the project, stakeholders, and decision maker (Table 4.1). Table 4.1: Key components of a project charter
1
Problem Statement
2
Scope
3
Metrics
4
Stakeholders
5
Decision maker
What is the problem that needs to be solved? Why is it important to solve this problem? What is included, what is excluded? What are the boundary conditions? How will we measure the project? How will we know we are done? Who will provide input to the project? Who will work on it? Who is our end customer? Who will be the final decision maker? What type of decisions are they responsible for?
The project charter starts with the problem statement. Why are you undertaking this project in the first place? What is the problem that needs to be solved? The problem statement should contain the basic facts of the problem. Include why the issue matters. This ensures there is alignment for why we are doing the project. In writing the problem statement ensure you are keeping to just the facts of the problem and not sneaking in possible solutions. The problem is not that we don’t have a report for monthly sales figures. The problem is that we want to know if our new marketing campaign is working. To do that, maybe looking at monthly sales figures is the correct approach. Maybe there is another metric that would be as good or better. If you assume the answer in setting up the problem, you limit your thinking. To make sure you aren’t unintentionally limiting your thinking by including solutions in your problem statement, you can use a standard format for problem statements. A good problem statement answers: who has the problem, what the problem is, where it occurs, when it occurs, and what the impact of the problem is to the business. You can use a fill-in-the-blanks style format like this: During (period of time for baseline), the (primary business measure) for (business process) was (baseline). This is a gap of (objective target vs. baseline) from (business objective) and represents (cost impact of the gap) of cost impact.
4.1 PROJECT CHARTER 21
Writing up my example of the new marketing campaign in this format, my customer and I can start with: “During the new marketing campaign, the sales of widgets was $50,000. This is a gap of $25,000 from our quarterly sales goal.” We can see, though, that what we really will want to know is how the sales compared with the new marketing campaign to sales without the campaign. We need to adjust the metric we use accordingly. We rewrite the statement to say: “During the new marketing campaign the sales of widgets was $50,000, $5,000 more than the period of the same duration prior to the campaign.” Now we have a clearer picture of how we need to format the report for the customer to answer the questions they have. We know we need to be able to compare periods with marketing campaigns to similar periods with baseline sales to see the impact of the campaign. We may still want to include the gap to the overall sales goal, or the customer may decide that isn’t important to the problem they are looking to solve. This reinforces the need to ask questions and get alignment with the customer of a project before you start building things. Taking time now to align on the problem statement prevents solving the wrong problem. Sometimes filling in the problem statement helps highlight that perhaps the problem is we don’t have defined metrics. You can then work with your customer to define what those metrics would be and set up the project to deliver those metrics in a dashboard. The scope of the project is another important item contained in the project charter—what is included and excluded? What are the boundary conditions you need to work within? Is there a budget for datasets from external vendors? Does all data need to be collected from internal sources? Is your report for a specific team or does it have multiple users? What other resource or technical constraints do you need to work within? Even if, and I should say, especially if, these are the “everyone knows that” type of boundaries, make sure you include them in your project charter. The reason is that you can then question whether they should be boundaries for your project and have the conversation with management about what would be possible if that constraint was removed. This sets up the ability to do a return on investment calculation and potentially relax some constraints. Setting up the scope helps you to assess, before you start the project, if you will have the resources you need to be able to deliver the desired outcomes. If things don’t line up, having the conversation before work on the project starts means you’ll be able to adjust either the constraints or the expected outcomes to match what you can get done. The charter should also include documentation on how you will know the project is successful. What are the measured criteria that allow you to know that you
22 4. DEFINE PHASE
have addressed the problem sufficiently? Are there customer requirements for model accuracy? Are there customer requirements for timeliness? Have those requirements been addressed satisfactorily by the project? These measurements help you know when you are done and are a layer of protection against scope creep. The expected return on investment from the project should be documented as well. That helps with prioritizing work, and with ensuring management understands the benefit of resourcing the project. Lastly, the charter should include who will provide input to the project—the stakeholders, and who will be the final decision maker. For each decision, there should be only one person who decides. To go fast, it is helpful to have established in advance who that will be.
Project Charter Project Name
New Marketing Campaign Dashboard
Project Owner
J. Weiner
Problem Statement
“During the new marketing campaign the sales of widgets was $50,000, this was $5,000 more than the period of the same duration prior to the campaign.”
Scope
Design and deploy a report that shows the impact of marketing campaigns on widget sales
In Scope
Widgets, marketing campaigns for widgets, report presented as an interactive dashboard design of visualizations to present data
Out of Scope
Non-widget products, other marketing campaigns, predictive models
Metrics
Item
Current Value
Goal
Note
Baseline sales of widgets $45,000.00 $75,000.00 Q3 2018 data
Widget sales during marketing campaign $50,000.00 $75,000.00 Q3 2019 data Stakeholders
Widget marketing team, J. Smith (Widget marketing team manager), Data science team, D. Jones (Data science team manager), Widget sales department
Decision Maker
J. Smith
Figure 4.1: Example project charter
4.2 SUPPLIER-INPUT-PROCESS-OUTPUT-CUSTOMER (SIPOC) ANALYSIS 23
Having the stakeholders and decision maker written down in the charter document does two things. One, at the beginning of the project it makes you think through who the project stakeholders are so you can get their input and feedback at the start of the project. They may be customers of the end result of the project, or they may be suppliers of data or other input. Writing down who they are means you have a list and can check in with them as the project progresses. Writing down the stakeholders is also a check on project scope. If you start to have a large number of stakeholders, then it might be good to scope the project down or break it into pieces that can be delivered separately to different sub-sets of stakeholders. The other thing that writing down the stakeholders and decision maker does is that it helps with alignment. Stakeholders can see who else will be giving input into the project, and they can see who the final decision maker will be. This prevents all your stakeholders from assuming, naturally, that they are the key decision maker for the project. It avoids the problem of too many bosses which will make finishing a project difficult due to expansion of scope and misalignment.
4.2
SUPPLIER-INPUT-PROCESS-OUTPUT-CUSTOMER (SIPOC) ANALYSIS
A SIPOC (pronounced “sigh-pock”) summarizes the inputs and outputs of a process in a table format; see Figure 4.2. More importantly, it includes the requirements for those inputs and outputs. For a data science project, this helps ensure that the model selected provides the expected accuracy and establishes what data are needed to generate that model. For a project delivering a report or dashboard, it helps determine the criteria for frequency of use, who the audience will be, and what data are needed to create the report or dashboard. A SIPOC is created in three parts. The first part establishes the start and end of the process that is being worked on. This could be to fix a problem or to enhance a process with automation or AI. For example, if you are intending to build a report to help make a decision, the process that is used to make the decision would be in the center of the SIPOC. The second part focuses on the output of the process and the customer, including the customer requirements. The third part looks at the inputs needed for the process to deliver the outputs, and what the requirements are for those inputs.
24 4. DEFINE PHASE
Figure 4.2: Supplier-input-process-output-customer (SIPOC) analysis table. The SIPOC is completed in three parts following the numbered steps.
Let’s use the following as an example and build the SIPOC. Say I am planning to build a report to help engineers determine if material meets certain criteria for quality and can continue to be processed, or if it should be scrapped. That decision-making process would go in the center of the SIPOC table. I would write down the start and end points: start—material is on hold for engineering decision; end—material is dispositioned, as shown in Figure 4.3.
Figure 4.3: Example SIPOC for engineering dispositioning material.
The next phase is to examine the outputs. For this example, I write down the outputs or deliverables from the process: dispositioned material. But let’s think that
4.2 SUPPLIER-INPUT-PROCESS-OUTPUT-CUSTOMER (SIPOC) ANALYSIS 25
through. Really, the output of the process is quality parts, so let’s make that change. The other output of the process is timely decision making. Not only do we want quality parts, we want to have our material flow through the factory and not be waiting for a long time. Next, I identify the customers who receive deliverables from the process: manufacturing, engineering, and engineering management. Manufacturing is a customer because they want to keep material flowing through the factory, and also want to make sure the manufacturing process is making quality parts. Engineering is a customer because the reason that material needs to be dispositioned is useful for troubleshooting the manufacturing process and making adjustments to ensure the process produces quality parts. Engineering management is a customer because they are responsible for their team to disposition material quickly and correctly and fully address problems, so they don’t re-occur. In thinking about the customers and why they are customers, we start to see what their requirements might be. The best way to know for certain is to go ask the customer. You can start this conversation by sharing what you think their requirements might be, and then listen to their feedback and additions. Finally, I write down the requirements for each output from each customer; see Figure 4.4.
Figure 4.4: Example SIPOC with both Part 1, Process, and Part 2, Output and Customer completed.
Think about requirements in terms of accuracy, timeliness, and completeness. If there are ways to measure these qualities for the output, include those metrics. In this example, the manufacturing line wants quick response time from the overall process, so material waits a minimal amount of time for engineering to decide. They also want the
26 4. DEFINE PHASE
correct choice, so material is not wasted or the final customer is not upset by receiving material of poor quality. Engineering wants the report to have all the necessary information in one place so they can quickly make a decision, they also want the report to be accurate so they can disposition the material correctly. Knowing the outputs then means I have a sense of what inputs are required. The third part of completing a SIPOC analysis is to look at the inputs to the process. This starts with writing down what inputs are required to enable the process to occur. Then you look at who or what supplies each of those inputs. Finally, you document the requirements for each input from each supplier. In our example, the engineer needs to know what material is waiting for them to disposition, why the material has been flagged for them to look at, and what happened on the equipment when the material was being processed. The list of material is necessary for timely decision making. Why the material was flagged and what happened on the equipment provide the engineer with information so they can make corrections to the process and prevent similar errors in the future. There is a clear connection between the outputs that meet the customer’s needs and the inputs required to build the report.
Figure 4.5: Example SIPOC with Part 3 Suppliers, Inputs started.
Once we know the needed inputs, we can identify who (or what system) supplies that information. Then we can determine the requirements for each of the inputs so that our report can meet the needs of our customers. In our example, as we complete the suppliers and requirements, we notice we need one other input. We need to know how long the material has been waiting to be able to meet the manufacturing requirement of not waiting longer than 12 hours. The data need to be accurate and include detail on what was measured compared to
4.2 SUPPLIER-INPUT-PROCESS-OUTPUT-CUSTOMER (SIPOC) ANALYSIS 27
the goal for that parameter. An example of this is statistical process control limits. This information is needed by the engineer to be able to answer why the material was flagged. Suppliers can be teams or systems. In this example, manufacturing is a supplier—if they have entered comments into the shop floor control system. Other suppliers are the databases for the statistical process control and shop floor control systems; see Figure 4.6.
Figure 4.6: Example SIPOC with all parts completed.
After completing the SIPOC, you might need to go back to your project charter and update the list of stakeholders based on the findings from listing out the customers and suppliers. You may also need to adjust the scope of the project based on what you’ve learned. The SIPOC is helpful in the phase of your data science project where you are acquiring and exploring data. From the SIPOC analysis you know the requirements for the inputs to your process, report, or model which will help you select data sources. From our example, we know we need to extract data for our report from the statistical process control database and from the shop floor control database because we have completed the SIPOC and understand the requirements to enable us to create the output the customer desires. As you do the data acquisition and exploration, show the results to your customer and get their feedback. The project charter and SIPOC help you ask the questions at the start of a project that set you up to succeed in the end and deliver a project to production. The project charter establishes the problem you will be working on, so you don’t solve the wrong problem, and sets the scope and definitions of success to prevent too big a scope and scope creep. The SIPOC allows you to think through stakeholders and
28 4. DEFINE PHASE
scope the project requirements as well as get alignment on the expected deliverables and requirements for those deliverables. This is useful for defining what “done” looks like for your project. The other factor in determining if you are done with a project is to compare the expected business value to the business value delivered. Achieving the expected business value early is a reason to re-assess the project scope and maybe stop working on it, after having a conversation with your stakeholders and final decision maker. Not meeting the expected business value after all the planned work is complete can be due to factors outside your control and is worth assessing if further work should be done, or if things are good enough as is. Again, this is a joint decision between the team working on the project and the final decision maker. In the next chapter, we’ll investigate how to calculate business value for data science projects.
4.3
TIPS FOR MANAGERS
Ask to see a project charter for each project your team works on. This will frame the work and provide metrics you can use to guide your team and assess the project status. Make sure you have alignment with your team on what “done” means for the project and hold them to it. Request a SIPOC analysis for projects and coach your team to complete it early in the define phase of the project. This analysis provides insight into the customer needs and requirements to enable the project to meet those needs. Having done this analysis, the team will know what data are required to build the report or model they will deliver to the customer. Have regular reviews with your team and make sure they are updating the project charter and the SIPOC as they learn about the problem the project helps to solve. Check in also to ensure the project isn’t suffering from scope creep. If you meet the expected business deliverables early, allow the team to be done with the project. Wrap up all projects with a retrospective where you compare the expected business value to the delivered value. Coach your team to investigate both cases of when you do not meet the expected value, and when you exceed the expected value. This will help them learn and refine their calculations and hone their ability to make more accurate estimates.
29
CHAPTER 5
Making the Business Case: Assigning Value to Your Project Part of the project charter is to document the expected return on investment for the project. Assessing the business value for your project will help get resources and funding. It helps to answer management’s question of “what do I get?” Knowing the expected deliverables for a project helps in getting support. It also helps with defining when you are done with a project. It can sometimes be difficult to assign a business value to a data science project. There is enough benefit to be derived from the effort to calculate the value that performing the exercise is worthwhile despite the difficulty. To help, I’ve identified business benefits by types of data science projects and created a table of conversions to dollars for common metrics. Table 5.1: Deliverables and metrics for various types of data science projects
Project Type
Data analysis
Automation Standards and business processes Data mining
Improved data science
Deliverables
Metrics
Productivity Root cause determination Time to decision Problem solving support Decision quality Problem identification Risk reduction Time to decision Waste eliminated Decision quality Standardization Excursion prevention Business process improve- Quality improvement ment Risk reduction Improved model accuracy New insights Decision quality Learned something new Risk reduction Productivity Increased capability Decision quality Advanced algorithms Risk reduction Time savings Decision support
30 5. MAKING THE BUSINESS CASE: ASSIGNING VALUE TO YOUR PROJECT
To build out these metrics, let’s look at the types of data science projects I’ve been involved in over the course of my career. I can group the projects I’ve done into five broad categories. I’ve done data analysis projects. I’ve built reports and automated processes. I’ve done projects to devise standards and improve business processes. I’ve delivered insights from data mining. I’ve done projects which improved my organization’s ability to do data science. Each type of project has different deliverables, and different metrics to measure those deliverables (Table 5.1).
5.1
DATA ANALYSIS PROJECTS
Productivity and time to decision are similar metrics, but slightly different. Productivity measures how much effort is required to determine root cause or identify problems. If your project makes this easier, you can measure the impact by assessing the productivity improvement and looking at the value of other work people can now do since they have time freed up from the results of your project. Time to decision is a measure of how long it takes to gather the information needed to make a decision. Here’s how a time to decision calculation works. Say it used to take someone two hours to pull together the information required when a particular decision needed to be made. This information needed to be presented to a meeting and the decision was made in that meeting. Say you were able to create a script to extract the data needed and build a visual like a graph rather than a table of numbers, which let the meeting attendees clearly see the information they needed to support the decision. It could be possible that you were able to reduce the time needed from 2 hours plus the time in the meeting (let’s say it would take 20 minutes in the meeting) to 10 minutes total. That’s an improvement in time to decision of 93%. Now, to convert to dollars, we take the average salary of the participants—the person who usually collected the information and the people making the decision and multiply that by the hours saved. I’m using example values here, so for ease let’s say the average salary is $100,000 annually. That works out to $50 per hour. The hours saved is 2 + 0.3333 = 2.3333 - 0.1666 hours = 2.1667 hours saved. Since in this case, only one person did the work to gather the data, we multiply by one. You save that amount every time this decision needs to be made, so if this is a decision that is made quarterly, the project delivers $433.34 annually (Table 5.2). That’s just from saving two hours of time for one person. If there is more than one person doing the task and the task is performed daily, the numbers can really add up.
5.1 DATA ANALYSIS PROJECTS 31
Table 5.2: Example calculation for time saved
Time Saved Per Task
# Times Task Performed Per Year
2.1667
4 (quarterly)
# of People Who Perform the Task 1
Average Hourly Salary
Total Value Delivered
$50
$433.34
Let me take a little time here to talk about decision quality. The quality of a decision is not defined by whether the outcome of the decision is good or bad. We could make a good decision and still have a bad outcome. Say we are deciding to plant a crop. We could make a quality decision and still have a bad outcome if the weather changed from its forecast. A good quality decision has the following characteristics: it is framed appropriately; there are alternatives; data and information are used to decide; the value and trade-offs are clear; logical reasoning is used; and there is commitment to follow through and to take action based on the decision. We can use these metrics to gauge the decision quality. If our analysis is providing the data and information needed to decide, how can we assess how much we have improved the quality versus not having that data?
5.2
AUTOMATION PROJECTS
For projects where you are adding automation, or building reports, the metrics are time to decision, waste eliminated, and decision quality. I talked about time to decision and decision quality in the previous section. Let’s focus on the concept of waste. Automation projects are often developed to save time and effort on repetitive tasks. Frequently they result in getting rid of unnecessary tasks or removing the need to wait for information. In Lean manufacturing there are seven types of waste. These have analogs to office work, and you can use similar calculations to quantify the benefit from eliminating the waste. I’ve summarized the types of waste in Table 5.3.
32 5. MAKING THE BUSINESS CASE: ASSIGNING VALUE TO YOUR PROJECT
Table 5.3: Types of waste with manufacturing and office examples
Waste Transportation
Office Examples Moving information Unread emails, reports to be Inventory read, approvals to be processed Walking to get a copy, reReaching for a tool, walking to Motion trieving a file from a drawer, get a ladder searching for a file on a drive Waiting for a part or for a Waiting for information, waitWaiting person ing for an approval Making more product than Making reports no one reads Over production customers are willing to pay for Doing more work to a product Adding features to a report Over processing than customers are willing to that don’t have a customer or a pay for use case Defects Parts that fail quality criteria Errors and omissions on forms
5.3
Manufacturing Examples Moving things Work in progress, parts in storage
IMPROVING BUSINESS PROCESSES
Some projects I’ve worked on resulted in defining standards for how work was done and improving business processes. In those cases, the deliverables are standards and improved business processes and the metrics are quality improvements and preventing process excursions. Improved quality can come from defining standards, such as a standard report that everyone who does a particular task uses. It can come from automating tasks to reduce errors, or from adding error checks to data entry. Process excursions can be prevented through applying statistical process control. This is not for factory processes only. I’ve applied statistical process control to change approval throughput times to identify data entry problems. Lean six sigma projects fall into this category and frequently provide very large business value, often by using data science—data analysis, developing automated reports, and creating models. My green belt project delivered over one million dollars in savings and used advanced statistical techniques like measurement system analysis and design of experiments. My black belt project delivered just about two million dollars in savings and used hypothesis testing and survey design.
5.6 METRICS TO DOLLAR CONVERSION 33
5.4
DATA MINING PROJECTS
For other projects, the key outcome is new insights. These can come in the form of improved model accuracy, increased decision quality, gaining knowledge from mining data, or just learning something new in general. Often insights from data mining result in new projects which deliver further value. Learning something new is generally hard to quantify but should be included as text in a final summary of your project.
5.5
IMPROVED DATA SCIENCE
Finally, one of the outcomes of a project can be improved data science. You can develop new algorithms, and you can increase the organization’s ability to do data science as a result of a project. I’ve done a number of projects that fall into this category, often as a means to an end. In one project, we needed to access data stored on equipment hard drives. Historically, collecting this data was completely manual, so it was not done frequently. As part of the project, I worked with my IT department to set up a shared file system where the data could be transferred automatically. This increased the organization’s ability to do data science by making the previously orphaned data easy to access and allowed it to be combined with other data sources to enable new insights.
5.6
METRICS TO DOLLAR CONVERSION
Converting these common metrics into dollar value enables comparisons between different types of projects and helps with prioritization if there are multiple things you could work on. Additionally, dollar value is easy to understand. Table 5.4 shows common metrics and the conversion factor to translate the metric to dollar value. Keep in mind that you frequently will want to annualize the value delivered. If you hadn’t done the project, people would have to go back to doing the task the old manual way. How often the task that has been improved is performed per year is a factor in how much value your project has delivered. One last point on assigning business value to a project. Data science projects often result in freeing up people’s time by automating tasks or using AI to accelerate work. Freeing up people’s time means that they can work on other things as a result of the project. That additional work that your customer has been able to handle should be monetized and included in your project’s results.
34 5. MAKING THE BUSINESS CASE: ASSIGNING VALUE TO YOUR PROJECT
Table 5.4: Common metrics and dollar conversion
Metric
Dollars saved
Revenue gained Productivity Lean waste reduction
Time to Market
Schedule acceleration
Time to decision
Quality improvement Excursion prevention Decision quality Risk reduction 3
Measure In
Conversion Factor One time: none Dollars Recurring: $/year × # years One time: none Dollars Recurring: $/year × # years Value of action people can Time saved, typically now do; decision making, Hours saved per year influencing, etc. (Value of increased market segment share)3 × (number Weeks pulled in of weeks accelerated to market) (Value of increased market Weeks pulled in (for the segment share)3 × (number of weeks accelerated to limiter) market) Multiply by average salary Time saved; Lean TIMof worker in role, calculate WOOD waste elimisavings due to waste elimnated4 inated Cost per unit, cost (loss) Yield improvement, errors per error, time wasted per reduced error per year × average salary Units saved Cost per unit Frame, alternatives, information, values and trade- Often decision alternaoffs, logical reasoning, tives are valued in dollars commitment to action Value of the risk metric Dollars
Increased market segment share can be difficult to quantify See Table 5.3.
4
35
CHAPTER 6
Acquisition and Exploration of Data Phase In Chapter 4, we introduced SIPOC analysis. One of the results of doing this analysis is that you get a picture of the inputs needed to build the final output of the project, whether that is a report, presentation, or model. In addition to listing the needed inputs, the SIPOC includes the requirements for those inputs in order to achieve the desired results for the outputs. This is useful in identifying data sources for the project. Not only do you know what data are required, you also have information on how often the data are needed, and what level of quality is desired. In the ideal case, the data you want to use are automatically collected and stored in a database which you can easily access. Unfortunately, this is not always the situation.
6.1
ACQUIRING DATA
In acquiring data, there are two cases. One, when the data are available already either in internal systems or from external sources, and two, when you don’t have the data. The first case is straightforward; you need to connect to the data source and then move on to the next step of exploring the data. The second case is more challenging, so I will spend time on it. When you don’t have the data that you need, you must determine if it is currently being collected. From time to time I have found that the data are not being collected. In some cases, there are existing systems to automatically collect and upload data. Including a parameter in these cases, means working with the system owner to add that parameter to the automated data collection. As far as data gathering goes, that is fairly simple. In other cases, there is no system and no data being collected. In this situation, you need to develop a data collection system.
6.2
DEVELOPING DATA COLLECTION SYSTEMS
If you are developing a system to collect data, and the data have not been collected previously, start small and start simple. Is there an automated system that captures the data? Can you store the data from that automated system in one standard location? As an example, say we are testing systems in a lab. The collected data are kept in files
36 6. ACQUISITION AND EXPLORATION OF DATA PHASE
on the test system’s hard drive, then transferred to the engineer’s computer for analysis. By switching the storage location for the collected data to a network drive, we can more easily explore data across multiple test systems. It is a simple change, much more straightforward than setting up a relational database to store the data. Long term, we may want to move in the direction of storing the data in a database. Our simple change has opened up the potential for analysis of data from multiple systems that we can start using right away. Another example of starting small and simple is when a team I worked with used a SharePoint5 list to collect data. We needed to forecast hardware use and would do the forecasts twice a year. These forecasts were kept in various office document formats like presentations or spreadsheets. Our problem was that we weren’t able to use the historic data from past forecasts because they were not saved anywhere systematically. By developing a standard location for the forecasts on our SharePoint and designing the SharePoint list to match formats people had been previously using, we made it easy for the team to enter the data and build a history. When you need to collect data from people, make it easy for them to enter the data. If you have a form or data collection tool that has a lot of fields, people will tend not to fill things in, or not fill fields in completely, if they feel it takes too long or the form is too big. When you make the fields required fields, typically people will do the minimum amount required, even if they may know more about the situation and could add information by filling in other fields. When you are collecting data from people, think about the experience from their perspective. People are busy. People typically will do the minimum unless they are passionate about something—for example, I’ve had situations where a manufacturing technician would be very frustrated by an ongoing problem and add a ton of helpful content into a comment field, because they were angry that the problem kept happening. In that case, our normal data collection systems didn’t transfer the complete information well from the techs to the engineers. Sometimes the people who enter the data don’t understand the value that can be gained from using the data they provide, so they do the minimum, or do it quickly and maybe with less attention to detail than you would like. Help them help you by minimizing the burden to enter data. Help them understand the value they deliver in collecting the data and entering it. Circle back to the people who provide the data with results from your analyses that have been made possible by their data and share what can be learned. Because of these difficulties with manual data collection, I advocate for automating data collection wherever possible. 5
Other marks are the properties of their respective owners
6.4 WHAT DOES THE CUSTOMER WANT TO KNOW? 37
Starting small and starting simple is helpful also because you have then tested a system, which is valuable information when you need to expand or grow it. For example, if you have started some manual collection, and determined that there are valuable insights that can be delivered from that data, it is easier to build a case to add sensors or other measurement instruments and collect the data automatically. In cases like this, I’ve collaborated with my IT department to develop systems that will automate the data acquisition. By starting small and already having data collected, I then have a good idea of the amount of effort that will be needed to automate the system, and a sense of the benefit that will be delivered. You can then calculate a return on investment for the effort which, in addition to information from your project charter and SIPOC, is useful for convincing the IT department to work on your project.
6.3
DATA EXPLORATION
Once you have collected the data you need for your project, the next step is to explore the data and begin to understand it and what insights you can gain from it. I start by graphing and visualizing the data. One of the first graphs I make is a distribution of each of the columns of data in the data set. This helps me scope the amount of data cleaning I will need to do. It also helps get a sense of the structure of the data, if there are missing values, and if I might need to transform some data. Then, I begin to look at relationships between the data using x-y graphs.
6.4
WHAT DOES THE CUSTOMER WANT TO KNOW?
As you explore the data set, keep in mind what you have learned from the SIPOC. Data exploration can be a time waster if you are looking at things in which the customer isn’t interested. At the same time, insights from data exploration can be extremely beneficial when they are unexpected and surprising. Having had the up-front conversations and the guidance from completing a project charter and SIPOC analysis aids in striking the correct balance. Focus your time on the primary questions the customer has of the data. Then, as you explore, think about secondary questions a customer might have. Take, for example, the case of building a report to look at a fleet of processing equipment. I might first want to know which tools are running, then I might want to know which tools had errors. From there I could ask about error frequency, or what are the most common errors. While doing the exploration, I can capture these questions and build them into the design of the report.
38 6. ACQUISITION AND EXPLORATION OF DATA PHASE
6.5
PREPARING FOR A REPORT OR MODEL
Once you have a sense of the data, you’ll begin the process of data cleaning. When working with data that is manually entered, you will have multiple spellings of the same word that need to be condensed. You will need to define or apply standards for capitalization. You will need to decide how to deal with missing data. Data cleaning frequently takes a long time, and just as frequently, isn’t mentioned. It’s important and worth the effort. When cleaning data think about how you will test that you are not over or under cleaning. In a recent project, I used a simple comparison of the automatically cleaned text to human cleaned text from the same input. This test became part of the continuous integration for the project, so any time the text cleaning code was changed, the tests to check for over or under cleaning were run automatically. When exploring the data and thinking about what the customer wants to know you may find that the data set as is doesn’t have the exact right parameters. This is where feature engineering comes into play. You will need to generate new features from the data available in the data set to help answer your customers’ questions. As an example, you may need to separate text fields into columns, or create columns through manipulating data in other columns—like calculating the duration of a process step from the timestamps for material moving in and out of that step. Other times, your data set has too much information. This is when feature selection comes into play. For smaller sets, you can accomplish this manually looking at correlations between parameters using x-y plots. If your data set is large, say over 1,000 columns, then use a machine learning-based method. The bulk of the time I spend on data science projects is in this phase of the project. Getting the data together can take a long time and can be a project in itself if you need to develop a data collection system. Cleaning the data can also be a major undertaking, not to mention feature engineering and feature selection. Going slowly in this step will make the rest of your project easier. This is yet another place where it pays to go slow to go fast.
6.6
TIPS FOR MANAGERS
When reviewing the SIPOC with your team during the define phase of a project, ask about data acquisition. Does the required data exist? What are their plans to acquire the data they need? Do they need support from IT or other groups within your organization? Do they need to purchase data sets externally? Does your organization have policy or guidelines in place regarding external data? Make sure your team is in
6.6 TIPS FOR MANAGERS 39
compliance with policies on privacy of personally identifiable information and other policies regarding data collection and storage. Have patience during this phase of a project. Ensure your team is taking the necessary time to effectively collect and clean the data for the project. Recognize the value that is being delivered in this phase of the project. New data collection methods and solutions are a clear business benefit delivered in this phase. Cleaned and labeled data sets are additional beneficial outputs from this phase. Coach your team to think about reuse and standardization as they work through data acquisition and exploration. Can they modularize data cleaning code so it can be used in another project? Can they upload the cleaned and labeled data to a database so it can be accessed by other teams in your organization? Ensure your team is engaging with the project customer during the exploration of the data so that they are answering the correct questions and solving the correct problem.
41
CHAPTER 7
Model-Building Phase Two of the project pitfalls relate directly to the model building phase of a data science project: couldn’t explain the model, and the model was too complex. The tools to address these pitfalls are to keep things simple and leverage explainability.
7.1
KEEP IT SIMPLE
The simpler the model, the easier it is to explain to others. Understanding early in the project how important it is to your customer to be able to understand how a model is using the input data to make predictions will ensure you use a type of model that meets that need. By completing the project charter and SIPOC you will have identified the customers and stakeholders for your project and understood their requirements. This can then be translated into the model selection process. The other thing the SIPOC can help with is to think about how you will maintain the model during the model selection process. By defining the input requirements in the SIPOC, you can use that information to assess how data hungry the model will be and that can help you decide between different choices. Part of keeping things simple is to use the most basic type of model that meets the project’s needs. Can you use classification and regression trees rather than a neural net? Can you use a simple tree model rather than an ensemble model? Can you use a physical model rather than machine learning? If there is already a known equation that connects the inputs to the output you want to predict, use it! This is another benefit from having done the work to document the SIPOC: you have thought through the inputs that your model requires and thought through the requirements for the model output. Having done that work, the task of model selection becomes easier. Simple models are much easier to explain. For a physical model, you can explain it by showing the equation. For a simple tree model, you can show the tree. Explaining why a model made a particular prediction becomes harder when you use ensemble methods or neural nets. This is an area of wide interest in the industry with a number of researchers working on explainability. Simple models are easier to connect to reality. There is a common trap in building models that you add parameters to improve the model but lose the connection
42 7. MODEL-BUILDING PHASE
between the model and what is actually happening in the real world. To be useful, a model needs to be tied to reality and based on real measurements.
7.2
REPEATABILITY
Models need to be repeatable. If I provide the same inputs, I should get the same results. This is easy to ensure if my model is simple. If different people run my model, they should get the same results given the same inputs. If the model is run on a different machine, I should get the same results. To be useful, models should be transportable—meaning I can share a model with another team. This is much easier to ensure if they are simple. Good coding practices help with repeatability and the ability to share models and code between teams. This is something to think about during the modeling phase of your project. Can you modularize your code so that other teams can use pieces of your project? How will you test your model? How will you verify that the predictions from your model are accurate?
7.3
LEVERAGE EXPLAINABILITY
Something to keep in mind when it comes to explainability is how data science savvy your customers are. Recently, I was discussing a project with another data scientist and their customer wanted to know why the model was predicting the importance of certain parameters. In this case, the model was using classification and regression trees to do feature selection. The difficulty is that the prediction of importance can only tell you so much—it says that these parameters correlate to the output, but that doesn’t itself tell you if those parameters are causal. To be able to explain causality, you need to consult with domain experts, or even perform additional experiments. The distinction between causality and correlation can be a difficult difference for people, even other engineers, to understand. I use caution with neural nets because of concerns around explainability. Neural nets can be incredibly useful and are a valuable tool. They can also be a black box. You can fool them and yourself if you are not careful in the selection of training data. An example of fooling a neural net is given in the paper “Why Should I Trust You: Explaining the Predictions of Any Classifier” (Ribeiro, Singh, and Guestrin, 2016), where the researchers trained a model to distinguish between husky dogs and wolves, but the training data set was set up so that all the wolves had snow in the picture and all the husky dogs where photographed without snow, typically indoors. Essentially, they trained a neural net to identify snow, not to differentiate between
7.4 TIPS FOR MANAGERS 43
husky and wolf. In this case, the training set was deliberately selected to be biased in this way for the purposes of the paper. The risk in using a neural net comes when this type of problem occurs unintentionally, and you are not aware of the problem in the training set. Between concerns around explainability and the desire to keep things simple, I typically prioritize using models in this order from simplest to most complex: 1. physical models; 2. classification and regression trees; 3. ensemble methods; and 4. neural nets. Of course, model selection is highly dependent on the type of data that will be used. Neural nets are particularly useful for visual analytics and natural language processing.
7.4
TIPS FOR MANAGERS
Make sure your team is keeping it simple. Use a physical model if one exists. Check in that they are not using the latest algorithm just because it is cool. Ask to see the data on what models they have tried and ask why they have selected the model they have chosen. Sometimes using an “old” method is the best, and sometimes there is a benefit to be gained from using the newest methods. Make sure your team is selecting models and algorithms to use based on business need. Ask your team about the model’s predictions. Do the predictions match what is really happening? Make certain that the model is tied to reality. Ask about the model parameters—are they available when someone will want to use the model? What are the assumptions in the model? Are they accurate? Help your team think these assumptions through and request they have data to validate their assumptions. Make sure the assumptions are clearly documented and updated as the model changes. Request that your team creates documentation on the model, what other models were tested, and why this one was selected. Beware of bias in training data, especially for neural nets but also for machine learning. Ask your team about how they are protecting against bias in their training data. Absolutely require two separate sets of data: a training set and a testing set. These are randomly created from the full dataset acquired. Don’t allow your team to fool themselves by testing the model against the data used to train it. Watch for overfitting. Overfitting is when parameters are included in a model so that the model accuracy on
44 7. MODEL-BUILDING PHASE
the training data set becomes very high. The problem is then that the model becomes overly specific, and the accuracy on other data will be lower than optimal. This is a place where keeping it simple helps. Testing on data not used to train the model will help your team detect overfitting. Support your team in applying good coding practices. Provide a source code control system and ensure your team uses it. Have them create standards and document how they write code so there is consistency across the team. Ensure that all dependencies for your team’s code are documented and included so other teams can reuse code your team has created, and that your team can repurpose code from others. Support your team in creating tests for code and models. Consider requiring continuous integration which uses automation tools which build and test code after each change (Manturewicz, 2019). Ask about model maintenance and how the model will be supported for the long term. What is the plan to maintain the model? What will trigger the need to retrain the model? Who will own the model long term? What are the systems that are in place to support the model? What does your team need to build and what business processes need to be developed?
45
CHAPTER 8
Interpret and Communicate Phase Every type of data science project ends at a different point. It occurs to me that one reason that 87% of AI/data science projects don’t get to deployment is because of how deployment could be defined. If the nature of that type of project is not to deploy a model, does that mean the project didn’t get to deployment? Depending on the project type, the end deliverable might be a presentation in a meeting. For the purposes of this book, I’ll consider deployment to be successful delivery of the final product of the project whether that is a model or communication of the results of an analysis. In Chapter 5, I listed the different types of data science projects I’ve worked on. Each of these types has a different final deliverable as summarized in Table 8.1. Table 8.1: Data science project types and typical final deliverables
Type of Project Data analysis Automation
Improved business process Data mining Improved data science
Typical Final Deliverables Presentation Automated report/dashboard Automated report/dashboard Deployed model Automated report/dashboard Deployed model Presentation Presentation
For data analysis projects, the deliverables are usually determining root cause for a problem, supporting the problem-solving process, or identifying problems that need to be fixed. This means that data analysis projects most often end by you presenting your findings in a meeting. Sometimes you then create an automated report or dashboard to enable your customer to continue to monitor various metrics resulting from your analysis. In projects where the goal is to automate a process, the typical deliverable is a report or dashboard. Sometimes I am asked to automate an analysis, often one which requires input from multiple sources. Sometimes I am automating the process of gathering data by generating a single report which includes all the information needed to
46 8. INTERPRET AND COMMUNICATE PHASE
make a given decision. The example I used in describing how the SIPOC works in Chapter 4 is this type of project. Automation projects can also result in deployed models. An example of this is the process control project I mentioned in Chapter 7. The process we were automating was a manual adjustment of equipment parameters based on statistical process control values. We developed and deployed a physical model that would automatically make the adjustments. Improving a business process typically includes developing standards for how work is done. The data science deliverable for this type of project is a report or dashboard that either is the standard, for example a report can provide one standard way to extract and view certain data in order to make decisions, or measures the business process and helps maintain the new systems. There can also be deployed models to support this type of change, depending on the level of automation in the business process. Data mining projects result in new insights. If those insights are not communicated, no business value is generated. This communication is usually done in meetings through presentations. It can also be accomplished via emailed reports or through writing a paper. Lastly, projects which improve data science should result in a presentation or paper to share that increased capability or new algorithm. You may wind up with improved data science capability as a side benefit to a project. No matter if it is the main intent or an additional outcome, sharing what you have learned with your organization increases the ability of the organization as a whole. It is worth spending the time to write up the learning as a paper or presentation.
8.1 KNOW YOUR AUDIENCE No matter if you are creating a presentation or building a model, it is important to understand who your audience is and to target your delivery to that audience. In the case of a presentation, the audience is the attendees of the meeting or event where you will be presenting. In the case of a report, your audience is the consumer of the report. For a model, the audience is the user of the model. In each case, you need to understand their needs and what information they hope to get. To prepare, you can ask yourself the following questions. Check in with representatives of your audience and ask them these questions as well. To start, you need to define who the audience is for this particular project. Is there a forum you will be presenting to? If that is the case, who is typically in the room? Are you providing information to support a decision? What is that decision? Who are you trying to in-
8.2 REPORTS 47
fluence? What are the primary questions your audience will want to answer with the information you are providing? When I talk of reports, I mean automated reports or dashboards. When I talk of presentations, I mean you sharing information typically in the form of a slide deck to a group. When I talk of models, I mean AI models that make predictions. I will tackle each one separately.
8.2 REPORTS For reports there are three rules. 1. Keep it simple. 2. Keep it clear. 3. Use good visuals. The reason to spend time to ask questions and understand your audience is that your role is not just to communicate findings or data, but to guide your audience, interpret the results of your findings, and highlight points of interest. The first rule is to keep it simple, and there it is again, one of the ways to avoid pitfalls for your project. For reports, put the most important information for your audience at the top left of the screen. The reason for this is that is where the eye goes first, since we read top to bottom, left to right.6 The second rule is to keep it clear. Be consistent with colors in your report. Users will come to associate particular colors with meaning, such as associating blue with Tool 102. If suddenly there is data for Tool 101, and it is shown in blue and Tool 102 is now green on graphs, your users may not notice that and will misread the graphs. Don’t confuse the person using the report with extra information. If you are not sure they will want to see it, provide a way to view the information on demand. For this reason, I prefer interactive reports. If a report is interactive, the user can influence what information is provided to them. Clearly, you need to have an understanding of what that might be in order to manage this. This is why it is important to have your end users involved in the project from the start, to define the scope, and throughout the design and development process. 6
If your report is for an Israeli audience, flip it and put the most important information on the top right. This may also apply for Chinese or Japanese audiences. Ask about where your user expects the most important information to be on a page.
48 8. INTERPRET AND COMMUNICATE PHASE
When building reports, I work in an iterative mode, and check in frequently with the audience for the report. How do they expect to see the information presented? Can you enhance understanding by showing the information in a different way? What common questions do they want to ask of the data after seeing the initial visualization? How can you support answering those questions in your report? I typically start with a mock-up either on a whiteboard or as a pencil sketch on paper. This is easy to modify and iterate before effort is placed into coding or working with a data visualization software package. The third rule of reports is to use good visuals. Since that’s not the primary focus of this book, I will recommend some sources. I’ll highlight the three books I would start with and have included a longer reading list in Table 8.2. The first book is Edward Tufte’s, The Visual Display of Quantitative Information, which outlines the principles of data visualization (Tufte, 2001). Stephen Few’s book, Show Me the Numbers: Designing Tables and Graphs to Enlighten, was the text book for a course I took on data visualization and has examples of both good and bad visualizations (Few, 2012). Finally, Storytelling with Data: A Data Visualization Guide for Business Professionals by Cole Nussbaumer Knafic gives practical tips and step-by-step examples of taking graphs from complicated and cluttered to simple and clear (Knafic, 2015). Table 8.2: Data visualization reading list
Title The Visual Display of Quantitative Information (2nd Edition) Show Me the Numbers: Designing Tables and Graphs to Enlighten (2nd Edition) Storytelling with Data: A Data Visualization Guide for Business Professionals Designing with the Mind in Mind: Simple Guide to Understanding User Interface Design Guidelines “Choosing Colors for Data Visualization” Envisioning Information Beautiful Evidence
Author
ISBN
Edward Tufte (2001) 978-0961392147 Stephen Few (2012)
978-0970601971
Cole Nussbaumer Knafic (2015)
978-1119002253
Jeff Johnson (2014)
978-0124079144
Maureen Stone Paper (Stone, 2006) (2006) Edward Tufte (1990) 978-0961392116 Edward Tufte (2006) 978-0961392178
8.3 PRESENTATIONS 49
Information Dashboard Design (2nd Edition) “Visualizations That Really Work” “Narrative Visualization: Telling Stories with Data”
Stephen Few (2013)
978-938377006
Scott Berinato (2016) Edward Segel and Jeffrey Heer (2010)
Paper (Berinato, 2016) Paper (Segel and Heer, 2010)
8.3 PRESENTATIONS For presentations, there are four rules. The first three are in common with reports: (1) keep it simple; (2) keep it clear; and (3) use good visuals. There is one additional rule for presentations and that is, (4) make your presentation tell a story. The goal of your presentation is to guide your audience, interpret the results of your findings to meet their requirements, and highlight points of interest. Spending time to understand who it is you will be presenting to gets you set up to accomplish those tasks. The first rule is to keep it simple. For presentations, put your interpretation of your findings at the very start of your slide deck. Knowing your audience helps here. Do you need to provide background information for them to understand the context? Is the forum you will be presenting to one of those that doesn’t let speakers get past the first slide without a ton of questions? If it is that type of forum, build your presentation to accommodate their style, by having one slide with the key point you want to communicate, and then links to backup information to answer anticipated questions. No matter what type of audience they are, keep to one idea per slide. Keep your presentation clear. Don’t confuse your audience with all the analysis you did to get to your final conclusion. Save that information in the backup of your presentation in case of questions. Retain only the key graphs you made which got you to the conclusion in your presentation, not every graph you generated in exploring the data. Part of keeping things clear is to not clutter your slides with extra information. If you have a dense slide, consider separating the information into multiple slides, or use builds to walk the audience through the information. Match what you say to what is on the screen at that moment. A very good piece of advice is to plan out what you will say for each slide and write it up either in the speaker notes section or in a separate word document. Depending on how formal the presentation will be, consider practicing your presentation. If your presentation will be timed, practicing your delivery is
50 8. INTERPRET AND COMMUNICATE PHASE
particularly important. Practicing will allow you to gauge if you have too much or too little content and will help get you used to working within the time limits. Think about what the key point is that you want to communicate. Since each slide only has one idea, make that idea the title of the slide. For maximum impact, I express my titles as headlines. For example, “Increased marketing budget correlates to increased sales” is the title of a slide with a graph showing marketing budget versus sales (Figure 8.1). Notice also how I have included a text box with my analysis and suggested course of action. Each slide should have a key takeaway message. To make that message clear to the audience, I include it in a text box at the bottom of the slide and use a build to allow them time to read the graph, before sharing my conclusion.
Increased Marketing Budget Correlates to Increased Sales Sales $70,000.00 Sales Revenue (dollars)
$60,000.00 $50,000.00 $40,000.00 $30,000.00 $20,000.00 $10,000.00 $0.00 $0.00
The more spent on marketing so far, the higher the sales revenue has been. It may be starting to top out at $30K. We should test that over the next two months by spending $32K each month. $5,000.00
$10,000.00
Mocked-up data from marketing and sales database, extracted August 28, 2010 by Joyce Weiner
$15,000.00 $20,000.00 Marketing Budget (dollars)
$25,000.00
$30,000.00
$35,000.00
Figure 8.1: Example presentation slide
The third rule of presentations is to use good visuals. Use graphics and visualizations to underscore your point. As you present, verbally walk the audience through your visualization, if they may not be familiar with reading that type of graph or table. Include the data source and when the data was extracted as footnotes. Good visuals often “grow legs” meaning a graph that really explains something well is often copied and used in other presentations. This means you did an excellent job in capturing an insight in a visual. Make sure your name is on the visuals you create, so that when they are shared, you get credit. For presentations, make your presentation tell a story. Like a story, your presentation should have a plot and a beginning, middle, and end. Telling a story makes your presentation easier to follow and makes it memorable. In the beginning of the story,
8.4 MODELS 51
let your audience know what to expect in your presentation. In the middle, deliver the content and the value of the presentation, and in the end, summarize what you covered. This is the classic, “Tell them what you’re going to tell them, tell them, and tell them what you told them.” There is a reason it’s a classic—because it is effective. Stories need plots. Some possible plots for presentations are: “My problem and how I solved it,” “The current problem, options for solving it and the one I like,” and “The current problem and the help I need from you.” In “my problem and how I solved it” you are sharing data analysis used to identify solutions to a problem and verify that the problem has been solved. Or, you might be presenting on improvements you made to model accuracy or a new algorithm. Another variation on this plot is “my problem and how I found it” where you report on data mining analysis used to uncover a problem. In “the current problem, options for solving it and the one I like” you are providing analysis to support decision making. You are also providing analysis of possible solutions and guiding the audience with your assessment of which is the best option and why. In “The current problem and the help I need from you” you are presenting on analysis of a problem and what resources are needed to move forward with a solution. Of course, for all these stories, you are providing supporting evidence in the form of charts, graphs, and tables.
8.4 MODELS For models, interpretation of the information is built into the system that has been put in place around the model and that uses the model’s output. For example, we have a system that predicts the need for preventative maintenance. Based on the results of the model’s prediction, the system will flag a user that maintenance is needed, or even schedule maintenance through existing systems. If the model predicts a tool needs adjustment, that might trigger a report with adjustment suggestions to engineering or might trigger an automated adjustment. The method that is used depends on trust level with that model, and how much experience the users have had with the model. So, depending on the user’s level of comfort with the model, projects involving a model might need reports, or they might need systems which interface with the model. These systems need to provide the inputs required by the model to generate a prediction and have rules or other methods to interpret that prediction.
52 8. INTERPRET AND COMMUNICATE PHASE
8.5 TIPS FOR MANGERS Coach your team to understand their audience before they build presentations, reports, or models. Check that they are keeping things simple, clear, and interpreting their results with the audience in mind. Become familiar with good data visualization practices and coach your team to apply them. Support them in setting up meetings and collaboration sessions with the end users of their reports and models so they can understand the users’ needs and work with their users to develop output that works best for the customer. Support your team in collaborating in an iterative way with the customer. Do this by having regular check-in meetings with the customer during development. At these meetings, the agenda is to review the progress, give a demonstration of the report or model to the customer, and collect feedback. For projects that end with a presentation, provide the presenter with information about the expected audience, meeting attendees, and personalities. If there are specific expectations for presenters in that meeting, share them with your team. Add context to help build an effective presentation. For example, if there will be an executive in the room, ensure the presentation is tailored to match that executive’s preferred style, whether that is having a single slide with backup information or having context before making a decision. Help your team collect this information. Good sources are the executive’s assistant, the meeting chair, and other people who have presented in that forum. Coach your team to practice giving their presentation before the meeting. Especially if presenting to senior executives. Provide an opportunity for a dry run with you and give feedback on content, flow, and delivery.
53
CHAPTER 9
Deployment Phase Begin planning for deployment from the beginning of your project. Excitement is a common trap and when you start a new project, you may just want to get some data and explore it and do some model building. The trouble is that then, you are started down a path without fully thinking it through. Taking the time at the beginning of the project to think about deployment gives your project an edge and can help you beat the odds and be one of the 13% of projects that gets to the deployment phase. Although tempting, don’t fall into the trap of minimally cleaning the data and rushing to build a model. This method of execution of a data science project makes deployment difficult because you haven’t thought about or planned for maintaining the model in deployment. Once you have something created it is really disappointing when you realize it is all wrong and needs to be scrapped. It is a much easier decision to make at the beginning before you have spent any time building.
9.1 PLAN FOR DEPLOYMENT FROM THE START In planning for deployment, you need to think through a few things. One, how will the model be deployed? By this I mean, how will the user access the model, and get a prediction? Two, if the project deliverable is not a model but a report, how will the user access and interact with the report? Three, who will maintain the project long term? How will you know if it stops working? What is the expected lifetime of the project? Four, how often will you need to update the extraction, transformation, and load for the input data? How often will you need to retrain or otherwise update the model? Any project in deployment has a cycle. This cycle goes like this: plan, implement, monitor, review. This is a loop that repeats over and over throughout the lifetime of the project. First, you plan any needed improvements, then you implement them, monitor the results, and review and decide on any needed changes. In deploying a model or report, using existing systems keeps things simple. It also helps ensure that your user will actually use your new model or report. There is nothing worse than spending time and effort in creating something that is then not used at all. As an aside, I recommend including some way to track usage for reports so that if you find a report you created is not used, you can circle back with your customer and have a conversation about their current needs and how your report could
54 9. DEPLOYMENT PHASE
be enhanced, adjusted to meet those needs, or possibly that your report is no longer needed and you can stop running it. Using systems that are already in place means less work is needed to get your report or model into production. Inserting a model into an existing system, or even building the model in a spreadsheet is easier than creating an all new application or building a website for your model. When I worked in manufacturing, all our factory equipment had associated computers to control the equipment. It was very easy to deploy models to that controlling computer and harness the existing systems.
9.2 DOCUMENTATION A favorite quote of mine is, “documentation is a gift you give your future self.” Having started with a project charter and a SIPOC, you have begun to document the project at the very start. It is best to continue with this practice of documenting as you go, rather than waiting until the very end to write the project documentation. When you wait, the burden of remembering what you did and why can become a mountain of a task and make documentation hard to complete. Frequently, projects fail because of lack of good documentation. A project may be implemented once but can’t be easily maintained because there was no transfer of knowledge. A good model can’t be reused because it doesn’t run on someone else’s computer due to a lack of undocumented dependencies. What often happens in these cases is that teams will redo a project, reinventing what had already existed because they are unable to use it. When thinking about deployment, think about the documentation for the project. Where will you store the code and the documentation? Make sure all information about the project is in one place. This can be a wiki, a shared drive, or a code repository. While you are documenting your project, take the time to write up all the decisions you made and why you made them. Did you investigate multiple models before selecting the one that worked the best? That is great information to have for the future, and for sharing with other teams in your organization. Did you select a particular language to use for scripting because it was easy to interface into existing systems? Again, great information to capture and keep for the long term. At the end of the project, when it has been deployed, take time to reflect and document the learnings you had over the course of the project. This should include things like new insights that were gained, new algorithms that were developed, and general learning that occurred. Also, take the time to go back and revisit the project
9.3 MAINTENANCE 55
charter. Did you accomplish what you planned to do? Why or why not? Write up your reflections and include them with the final business value delivered by your project.
9.3 MAINTENANCE A big consideration in the deployment phase is who will maintain the report or model. The SIPOC is helpful in defining this as it gives information about who will be using the model and what they expect. If your user is expecting 24×7 support, you need to plan for that before putting your solution into production. I ran into the problem of not planning for maintenance early in my career. I developed a report to make it easier for manufacturing to make some production decisions, and my report stopped working at 2 am. Of course, I was called in the middle of the night to fix the report and get production running again. While that was all right as a one-time solution, it would not work in the long term. I needed to convert my report to work with existing systems that were supported on a round-the-clock basis. If I had thought this through at the beginning, it would have prevented a scramble and a re-write of the report. Before your project goes into deployment, think about how you will know if the report or model has stopped working. For a model, this is about error checking and verification. Will you have a defined testing cycle? Can you detect errors automatically? For a report this can be as simple as having a programmatically generated time stamp at the top. The user can then check to see that the report has updated before using the information. There is nothing worse that learning that decisions have been made based on a report or dashboard that hasn’t been updated in two weeks. I include a timestamp and a support phone number or email in my reports, so if a user identifies that the report has stopped, they can contact the report owner or support team and notify them of the problem. Think about when you will update the model, and what will trigger an update. The same goes for reports. Will you establish a time-based update cycle? Is there a specific event that would trigger an update? For a model, you can measure accuracy, and if it drops below a certain threshold, trigger retraining of the model. Sometimes there are external factors that influence when you should retrain, such as changes to the process that the model is built for, or changes to automation. Finally, think about the expected lifetime for your project. What will trigger obsolescence? It is unlikely that you will be running the same report or model forever. How will you know if it is no longer being used? For reports, I like to have some way of telling if users are interacting or view them. When that usage count falls off, it’s
56 9. DEPLOYMENT PHASE
time to have a conversation with the main decision maker about the usefulness of the report. At this point you have two choices, you can update the report to meet the new needs, or you can cancel the report because it is no longer useful. When you no longer need a report or model, having documentation about all the pieces and dependencies is incredibly helpful in completing end of life tasks. You don’t want to remove or delete something that other models, reports, or teams rely on. Cleaning up after a report or model frees up shared resources (compute, storage, etc.).
9.4 TIPS FOR MANAGERS Work with your team to plan for deployment from the very start of a project. Different projects have different end points and final deliverables (Table 8.1). Support your team in recognizing which type of project they are working on and delivering to those endpoints. Help them identify when a project is done, hold the retrospective, and celebrate the completion and delivery of projects with them. Remember, when your team deploys a data science project, they have beaten the odds and have avoided the project pitfalls. Make sure they recognize the benefit of having structure for project development and of going slow to go fast. Check that your team is documenting the project. This includes the project charter and SIPOC. Documentation should continue throughout the project, and not be only done at the very end. Check in with your team and ask about the current documentation. When decisions are made in the direction of the project, like which model to use, or which system to use to deploy the model, make sure this is captured in the project documentation. Capture not just the decision, but what alternatives were explored and why the one was selected. This is helpful in case a decision needs to be revisited in the future. Support your team by having systems in place for collecting documentation and storing it. Once a project is deployed, make sure the project charter is updated to capture the delivered business value, and include the writeup from the retrospective in the document repository. Slow your team down and have them plan out the work at the start of a project, or whenever you discover that is has not been done. Have them do a project charter and SIPOC to be set up properly for deployment. Keep things simple by asking how they can use existing systems to deploy projects. Make sure they are thinking about the full lifetime of a project. Help them think through maintenance of models and reports. Ask your team how they will monitor the report or model to know that it is working properly. Ask how they will monitor the report or model to know that it is still being used. Help them plan for obsolescence and have systems in place for decommissioning
9.4 TIPS FOR MANAGERS 57
reports and models. Either build the capability for long-term support of models and reports within your team or establish systems and methods for transferring projects to other teams for long-term support.
59
CHAPTER 10
Summary of the Five Methods to Avoid Common Pitfalls The current statistic is that 87% of AI/big data projects fail. In this context, failing means that the project never reaches deployment. By applying 5 methods to avoid common pitfalls, you can give your project a better opportunity to beat the odds and be one of the 13% that make it to production. The five pitfalls are: 1. the scope of the project is too big; 2. the project scope increased in size as the project progressed—e.g., scope creep; 3. the model couldn’t be explained, hence there was lack of trust in the solution; 4. the model was too complex and therefore was difficult to maintain; and 5. the project solved the wrong problem. The five methods are: 1. ask questions; 2. get alignment; 3. keep it simple,; 4. leverage explainability; and 5. have the conversation.
10.1 ASK QUESTIONS First of all, ask questions. This is something you need to do from the start of the project and continue through the very end. Ask questions to get feedback during the define phase of the project as you are staring to design the solution. Ensure that you understand the problem that is to be solved and the process that is to be addressed and ask questions to verify your understanding. Ask the questions to get this input before you
60 10. SUMMARY OF THE FIVE METHODS TO AVOID COMMON PITFALLS
start gathering data or building a model. Use the project charter and SIPOC analysis tool to guide you in asking these questions. Asking questions is not a one and done type of thing. Continue to check in with the project stakeholders and customers as you go to ensure you are solving the right problem and meeting their requirements. Update the charter and SIPOC as you gain clarity and learn more about your customer’s needs. Show your customer the results of your initial data exploration and ask for feedback. Ask about explainability, and how the model will be used. Ask about long-term considerations like who will own maintaining the model. Ask for feedback at the end of the project.
10.2 GET ALIGNMENT Second, get alignment. Use the project charter to document the expected business value your project will deliver. You can reassess this and revise the charter as you go and realign with your stakeholders and decision maker. Make sure you have aligned on what done looks like to prevent scope creep and enable you to actually finish the project and get it into production. Be sure you are aligned with your customer on explainability. Check in to be sure they are not confused by terminology like correlation and causality. Make sure you are using a model that fits their needs. Get alignment on boundary conditions and long-term considerations. Share your plans for deployment and maintenance with your customer and decision maker.
10.3 KEEP IT SIMPLE Keep it simple. Use physical models and simple techniques. Use the minimum number of input parameters to achieve your project goals. Plan ahead for deployment and keep the deployment method simple. Use systems that already exist rather than creating new things. Think long term and think about how the model will be maintained. Simple solutions are easier to maintain and are easier to transfer from owner to owner.
10.4 LEVERAGE EXPLAINABILITY Fourth, leverage explainability if it is important to your customer and their decision-making process. Ask about their process up front so that you establish their criteria for explainability and use a model that your customer is comfortable with.
10.5 HAVE THE CONVERSATION 61
Consider simpler models like physical models or simple tree models to support explainability. Take advantage of new techniques for explainability that are currently being researched.
10.5 HAVE THE CONVERSATION Lastly, have the conversation. An ongoing two-way conversation between you and the users of your models and reports is a valuable thing. Keep your customer involved as you go. This prevents scope creep and overbuilding. Continue to check in with your customer throughout all the phases of the project. Start involving your customer in the define phase by jointly creating the project charter and getting their input in the SIPOC analysis. Keep them updated as you acquire and explore the data and build models. Share the interpretation of the results and your plans for deployment. This on-going conversation will prevent you from solving the wrong problem. Continually update your project charter as you go to help facilitate the communication. Finally, when the project is complete and in production, reflect on what you learned overall, calculate the delivered business value, and share both with your customer and decision maker. By following these five methods, you will ensure your project doesn’t fall into the pitfalls of scope creep, too big a scope, a model that can’t be explained and is too complex, or that you solved the wrong problem. This will set you up to deliver a project that beats the odds and makes it into production.
63
References Beck, K. and Beedle, M. (2001). Principles behind the Agile Manifesto. Retrieved from agilemanifesto.org: https://agilemanifesto.org/principles.html. 10 Berinato, S. (2016). Visualizations that really work. Retrieved from Harvard Business Review: https://hbr.org/2016/06/visualizations-that-really-work. 49 Deming, W. E. (1993). A bad system will beat a good person every time. Retrieved from The W. Edwards Deming Institute: https://deming.org/a-bad-systemwill-beat-a-good-person-every-time/. 17 Few, S. (2012). Show Me the Numbers: Designing Tables and Graphs to Enlighten. El Dorado Hills, CA: Analytics Press. 48 Few, S. (2013). Information Dashboard Design. Analytic Press. 49 George, M. L., Rowlands, D., and Kastle, B. (2003). What is Lean Six Sigma? McGraw-Hill Education. 2 Johnson, G. (2014). Designing with the Mind in Mind: Simple Guide to Understanding User Interface Design Guidelines. 2nd Edition. Morgan Kaufmann. 48 Knafic, C. N. (2015). Storytelling With Data: A Data Visualization Guide for Business Professionals. Hoboken, NJ: John Wiley and Sons, Inc. DOI: 10.1002/9781119055259. 48 Manturewicz, M. (2019). What is CI/CD—all you need to know. Retrieved from https://codilime.com/; https://codilime.com/what-is-ci-cd-all-you-needto-know/. 44 Oxford Languages. (2020). Artificial intelligence definition. Retrieved from google. com: https://tinyurl.com/y6zwlnkw. xi Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). “Why should I trust you?”: Explaining the predictions of any classifier. arXiv 1602.04938 [cs.LG]. Retrieved from https://arxiv.org/abs/1602.04938. DOI: 10.18653/v1/N16-3020. 42 Royal Society. (2019). Explainable AI: the basics Policy Briefing. Retrieved from Royal Society: https://www.exploreaiethics.com/reports/explainable-ai-the-basics/. 15
64 REFERENCES
Segel, E. and Heer, J. (2010). Narrative visualization: Telling stories with data. IEEE Transactions on Visualization and Computer Graphics (Proc. InfoVis). DOI: 10.1109/TVCG.2010.179. 49 Stone, M. (2006). Choosing colors for data visualization. Retrieved from Perceptual Edge: https://www.perceptualedge.com/articles/b-eye/choosing_colors.pdf. 48 Tufte, E. (2001). The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press. 48 Tufte, E. (1990). Envisioning Information. Cheshire, CT: Graphics Press. 48 Tufte, E. (2006). Beautiful Evidence. Cheshire, CT: Graphics Press. 48 VB Staff. (2019). Why do 87% of data science projects never make it into production? Retrieved from Venturebeat.com: https://venturebeat.com/2019/07/19/ why-do-87-of-data-science-projects-never-make-it-into-production/. vi, 1
65
Author Biography Joyce Weiner is a Principal Engineer at Intel Corporation. Her area of technical expertise is data science and using data to drive efficiency. Joyce is a black belt in Lean Six Sigma. She has a B.S. in Physics from Rensselaer Polytechnic Institute, and an M.S. in Optical Sciences from the University of Arizona. She lives with her husband outside Phoenix, Arizona.