226 31 3MB
English Pages 247 [243] Year 2021
TECHNOLOGY IN AC TION™
Computer Vision with Maker Tech Detecting People With a Raspberry Pi, a Thermal Camera, and Machine Learning — Fabio Manganiello
Computer Vision with Maker Tech Detecting People With a Raspberry Pi, a Thermal Camera, and Machine Learning
Fabio Manganiello
Computer Vision with Maker Tech: Detecting People With a Raspberry Pi, a Thermal Camera, and Machine Learning Fabio Manganiello Amsterdam, The Netherlands ISBN-13 (pbk): 978-1-4842-6820-9 https://doi.org/10.1007/978-1-4842-6821-6
ISBN-13 (electronic): 978-1-4842-6821-6
Copyright © 2021 by Fabio Manganiello This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Trademarked names, logos, and images may appear in this book. Rather than use a trademark symbol with every occurrence of a trademarked name, logo, or image we use the names, logos, and images only in an editorial fashion and to the benefit of the trademark owner, with no intention of infringement of the trademark. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Managing Director, Apress Media LLC: Welmoed Spahr Acquisitions Editor: Aaron Black Development Editor: James Markham Coordinating Editor: Jessica Vakili Distributed to the book trade worldwide by Springer Science+Business Media New York,1 NY Plazar, New York, NY 10014. Phone 1-800-SPRINGER, fax (201) 348-4505, e-mail orders-ny@ springer-sbm.com, or visit www.springeronline.com. Apress Media, LLC is a California LLC and the sole member (owner) is Springer Science + Business Media Finance Inc (SSBM Finance Inc). SSBM Finance Inc is a Delaware corporation. For information on translations, please e-mail [email protected]; for reprint, paperback, or audio rights, please e-mail [email protected]. Apress titles may be purchased in bulk for academic, corporate, or promotional use. eBook versions and licenses are also available for most titles. For more information, reference our Print and eBook Bulk Sales web page at http://www.apress.com/bulk-sales. Any source code or other supplementary material referenced by the author in this book is available to readers on GitHub via the book’s product page, located at www.apress.com/978-1-4842-6820-9. For more detailed information, please visit http://www.apress.com/source-code. Printed on acid-free paper
Table of Contents About the Author ................................................................................vii About the Technical Reviewer .............................................................ix Introduction .........................................................................................xi Chapters at a Glance .........................................................................xiii Chapter 1: Introduction to Machine Learning .......................................1 1.1 History .............................................................................................................3 1.2 Supervised and unsupervised learning...........................................................9 1.3 Preparing your tools ......................................................................................11 1.3.1 Software tools ......................................................................................12 1.3.2 Setting up your environment ................................................................13 1.4 Linear regression ..........................................................................................17 1.4.1 Loading and plotting the dataset ..........................................................18 1.4.2 The idea behind regression ..................................................................19 1.4.3 Gradient descent ..................................................................................22 1.4.4 Input normalization...............................................................................28 1.4.5 Defining and training the model ...........................................................30 1.4.6 Evaluating your model ..........................................................................34 1.4.7 Saving and loading your model ............................................................38
iii
TABLE OF CONTENTS
1.5 Multivariate linear regression .......................................................................41 1.5.1 Redundant features ..............................................................................45 1.5.2 Principal component analysis...............................................................47 1.5.3 Training set and test set .......................................................................52 1.5.4 Loading and visualizing the dataset .....................................................53 1.6 Polynomial regression...................................................................................60 1.7 Normal equation ...........................................................................................64 1.8 Logistic regression........................................................................................67 1.8.1 Cost function ........................................................................................72 1.8.2 Building the regression model from scratch ........................................75 1.8.3 The TensorFlow way .............................................................................82 1.8.4 Multiclass regression ...........................................................................84 1.8.5 Non-linear boundaries ..........................................................................85
Chapter 2: Neural Networks ...............................................................87 2.1 Back-propagation .........................................................................................96 2.2 Implementation guidelines..........................................................................104 2.2.1 Underfit and overfit.............................................................................106 2.3 Error metrics ...............................................................................................111 2.4 Implementing a network to recognize clothing items.................................116 2.5 Convolutional neural networks ...................................................................134 2.5.1 Convolutional layer .............................................................................138 2.5.2 Pooling layer .......................................................................................144 2.5.3 Fully connected layer and dropout .....................................................145 2.5.4 A network for recognizing images of fruits ........................................146
iv
TABLE OF CONTENTS
Chapter 3: Computer Vision on Raspberry Pi ...................................159 3.1 Preparing the hardware and the software ..................................................161 3.1.1 Preparing the hardware ......................................................................161 3.1.2 Preparing the operating system .........................................................166 3.2 Installing the software dependencies .........................................................169 3.3 Capturing the images ..................................................................................179 3.4 Labelling the images ...................................................................................182 3.5 Training the model ......................................................................................185 3.6 Deploying the model ...................................................................................195 3.6.1 The OpenCV way .................................................................................197 3.6.2 The TensorFlow way ...........................................................................201 3.7 Building your automation flows ..................................................................205 3.8 Building a small home surveillance system................................................208 3.9 Live training and semi-supervised learning................................................216 3.10 Classifiers as a service .............................................................................219
Bibliography .....................................................................................227 Index .................................................................................................231
v
About the Author Fabio Manganiello has more than 15 years of experience in software engineering, many of these spent working on machine learning and distributed systems. In his career he has worked, among others, on natural language processing, an early voice assistant (Voxifera) developed back in 2008, and machine learning applied to intrusion detection systems, using both supervised and unsupervised learning; and he has developed and released several libraries to make models’ design and training easier, from back-propagation neural networks (Neuralpp) to self-organizing maps (fsom), to dataset clustering. He has contributed to the design of machine learning models for anomaly detection and image similarity in his career as a tech lead in Booking. com. In the recent years, he has combined his passion for machine learning with IoT and distributed systems. From self-driving robots to people detection, to anomaly detection, to weather forecasting, he likes to combine the flexibility and affordability of tools such as Raspberry Pi, Arduino, ESP8266, and cheap sensors with the power of machine learning models. He's an active IEEE member and open source enthusiast and has contributed to around 100 open source projects over the years. He is the creator and main contributor of Platypush, an all-purpose platform for automation that aims to connect any device to any service, and also provides TensorFlow and OpenCV integrations for machine learning.
vii
About the Technical Reviewer Vishwesh Ravi Shrimali graduated from BITS Pilani, where he studied mechanical engineering, in 2018. Since then, he has worked with Big Vision LLC on deep learning and computer vision and was involved in creating official OpenCV AI courses. Currently, he is working at MercedesBenz Research and Development India Pvt. Ltd. He has a keen interest in programming and AI and has applied that interest in mechanical engineering projects. He has also written multiple blogs on OpenCV and deep learning on Learn OpenCV, a leading blog on computer vision. He has also coauthored Machine Learning for OpenCV 4 (second edition) by Packt. When he is not writing blogs or working on projects, he likes to go on long walks or play his acoustic guitar.
ix
Introduction The field of machine learning has gone through a massive growth in the last few years, thanks to increased computing power, increased funding, and better frameworks that make it easier to build and train classifiers. Machine learning, however, is still considered as a tool for IT giants with plenty of resources, data, and computing power. While it is true that models get better when they can be trained with more data points, and computing power surely plays a role in the ability to train complex models, in this book we'll see that it's already possible to build models trained on data points gathered from a smart home environment (like temperature, humidity, presence, or camera images), and you can already use those models to make your home “smarter.” Such models can be used for predictions even on a Raspberry Pi or similar hardware. After reading this book, you will •
Know the formal foundations of the main machine learning techniques
•
Know how to estimate how accurate a classifier is in making predictions and what to tweak in order to improve its performance
•
Know how to cleanse and preprocess your data points to maximize the performance of your models
•
Be able to build machine learning models using TensorFlow and the standard Python stack for data analysis (numpy, matplotlib, pandas)
xi
INTRODUCTION
•
xii
Be able to set up a Raspberry Pi with a simple network of sensors, cameras, or other data sources that can be used to generate data points fed to simple machine learning models, make predictions on those data points, and easily create, train, and deploy models through web services
Chapters at a Glance Chapter 1 will go through the theoretical foundations of machine learning. It will cover the most popular approaches to machine learning, the difference between supervised and unsupervised learning, and take a deep dive into regression algorithms (the foundation block for most of today's supervised learning). It will also cover the most popular strategies to visualize, evaluate, and tweak the performance of a model. Chapter 2 will take a deep dive into neural networks, how they operate and “learn,” and how they use computer vision. It will also cover convolutional neural networks (CNNs), a popular architecture used in most of today's computer vision classification models. Chapter 3 will provide an overview of the most common tools used by makers in IoT today with a particular focus on the Raspberry Pi. We'll see how to use one of these devices to collect, send, and store data points that can be used to train machine learning models, and to train a simple model to detect the presence of people in a room using a cheap camera, and how to use it to make predictions within a home automation flow— for example, turn the lights on/off when presence is detected or send a notification when presence is detected but we are not home. The chapter will also provide an introduction to some strategies for semi-supervised learning and show how to wrap a web service around TensorFlow to programmatically create, train, and manage models.
xiii
CHAPTER 1
Introduction to Machine Learning Machine learning is defined as the set of techniques to perform through a machine a task it wasn’t explicitly programmed for. It is sometimes seen as a subset of dynamic programming. If you have some prior experience with traditional programming, you’ll know that building a piece of software involves explicitly providing a machine with an unambiguous set of instructions to be executed sequentially or in parallel in order to perform a certain task. This works quite well if the purpose of your software is to calculate the commission on a purchase, or display a dashboard to the user, or read and write data to an attached device. These types of problems usually involve a finite number of well-defined steps in order to perform their task. However, what if the task of your software is to recognize whether a picture contains a cat? Even if you build a software that is able to correctly identify the shape of a cat on a few specific sample pictures (e.g., by checking whether some specific pixels present in your sample pictures are in place), that software will probably fail at performing its task if you provide it with different pictures of cats—or even slightly edited versions of your own sample images. And what if you have to build a software to detect spam? Sure, you can probably still do it with traditional programming—you can, for instance, build a huge list of words or phrases often found in spam emails—but if your software is provided with words similar to those on your list but that are not present on your list, then it will probably fail its task. © Fabio Manganiello 2021 F. Manganiello, Computer Vision with Maker Tech, https://doi.org/10.1007/978-1-4842-6821-6_1
1
Chapter 1
IntroduCtIon to MaChIne LearnIng
The latter category includes tasks that traditionally humans have been considered better at performing than machines: a machine is million times faster than a human at executing a finite sequence of steps and even solving advanced math problems, but it’ll shamefully fail (at least with traditional programming) at telling whether a certain picture depicts a cat or a traffic light. Human brains are usually better than machines in these tasks because they have been exposed for several years to many examples and sense-based experiences. We can tell within a fraction of a second whether a picture contains a cat even without having full experience about all the possible breeds and their characteristics and all of their possible poses. That’s because we’ve probably seen other cats before, and we can quickly perform a process of mental classification that labels the subject of a picture as something that we have already seen in the past. In other words, our brains have been trained, or wired, over the years to become very good at recognizing patterns in a fuzzy world, rather than quickly performing a finite sequence of complex but deterministic tasks in a virtual world. Machine learning is the set of techniques that tries to mimic the way our brains perform tasks—by trial and error until we can infer patterns out of the acquired experience, rather than by an explicit declaration of steps. It’s worth providing a quick disambiguation between machine learning and artificial intelligence (AI). Although the two terms are often used as synonyms today, machine learning is a set of techniques where a machine can be instructed to solve problems it wasn’t specifically programmed for through exposure to (usually many) examples. Artificial intelligence is a wider classification that includes any machine or algorithm good at performing tasks usually humans are better at—or, according to some, tasks that display some form of human-like intelligence. The actual definition of AI is actually quite blurry—some may argue whether being able to detect an object in a picture or the shortest path between two cities is really a form of “intelligence”—and machine learning may be just one possible tool for achieving it (expert systems, for example, were quite popular in the 2
Chapter 1
IntroduCtIon to MaChIne LearnIng
early 2000s). Therefore, through this book I’ll usually talk about the tool (machine learning algorithms) rather than the philosophical goal (artificial intelligence) that such algorithms may be supposed to achieve. Before we dive further into the nuts and bolts of machine learning, it’s probably worth providing a bit of context and history to understand how the discipline has evolved over the years and where we are now.
1.1 History Although machine learning has gone through a very sharp rise in popularity over the past decade, it’s been around probably as long as digital computers have been around. The dream of building a machine that could mimic human behavior and features with all of their nuances is even older than computer science itself. However, the discipline went through a series of ups and downs over the second half of the past century before experiencing today’s explosion. Today’s most popular machine learning techniques leverage a concept first theorized in 1949 by Donald Hebb [1]. In his book The Organization of Behavior, he first theorized that neurons in a human brain work by either strengthening or weakening their mutual connections in response to stimuli from the outer environment. Hebb wrote, “When one cell repeatedly assists in firing another, the axon of the first cell develops synaptic knobs (or enlarges them if they already exist) in contact with the soma of the second cell.” Such a model (fire together, wire together) inspired research into how to build an artificial neuron that could communicate with other neurons by dynamically adjusting the weight of its links to them (synapses) in response to the experience it gathers. This concept is the theoretical foundation behind modern-day neural networks. One year later, in 1950, the famous British mathematician (and father of computer science) Alan Turing came with what is probably the first known definition of artificial intelligence. He proposed an 3
Chapter 1
IntroduCtIon to MaChIne LearnIng
experiment where a human was asked to have a conversation with “someone/something” hidden behind a screen. If by the end of the conversation the subject couldn’t tell whether he/she had talked to a human or a machine, then the machine would have passed the “artificial intelligence” test. Such a test is today famously known as Turing test. In 1951, Christopher Strachey wrote a program that could play checkers, and Dietrich Prinz, one that could play chess. Later improvements during the 1950s led to the development of programs that could effectively challenge an amateur player. Such early developments led to games being often used as a standard benchmark for measuring the progress of machine learning—up to the day when IBM’s Deep Blue beat Kasparov at chess and AlphaGo beat Lee Sedol at Go. In the meantime, the advent of digital computers in the mid-1950s led a wave of optimism in what became known as the symbolic AI. A few researchers recognized that a machine that could manipulate numbers could also manipulate symbols, and if symbols were the foundation of human thought, then it would have been possible to design thinking machines. In 1955, Allen Newell and the future Nobel laureate Herbert A. Simon created the Logic Theorist, a program that could prove mathematical theorems through inference given a set of logic axioms. It managed to prove 38 of the first 52 theorems of Bertrand Russell’s Principia Mathematica. Such theoretical background led to early enthusiasm among researchers. It caused a boost of optimism that culminated in a workshop held in 1956 at Dartmouth College [2], where some academics predicted that machines as intelligent as humans would have been available within one generation and were provided with millions of dollars to make the vision come true. This conference is today considered as the foundation of artificial intelligence as a discipline. In 1957, Frank Rosenblatt designed the perceptron. He applied Hebb’s neural model to design a machine that could perform image recognition. The software was originally designed for the IBM 704 and installed on a custom-built machine called the Mark 1 perceptron. Its main goal 4
Chapter 1
IntroduCtIon to MaChIne LearnIng
was to recognize features from pictures—facial features in particular. A perceptron functionally acts like a single neuron that can learn (i.e., adjust its synaptic weights) from provided examples and make predictions or guesses on examples it had never seen before. The mathematical procedure at the basis of the perceptron (logistic regression) is the building block of neural networks, and we’ll cover it later in this chapter. Despite the direction was definitely a good one to go, the network itself was relatively simple, and the hardware in 1957 definitely couldn’t allow the marvels possible with today’s machines. Whenever you wonder whether a Raspberry Pi is the right choice for running machine learning models, keep in mind that you’re handling a machine almost a million times more powerful than the one used by Frank Rosenblatt to train the first model that could recognize a face [4, 5]. The disappointment after the perceptron experiment led to a drop of interest in the field of machine learning as we know it today (which only rose again during the late 1990s, when improved hardware started to show the potential of the theory), while more focus was put on other branches of artificial intelligence. The 1960s and 1970s saw in particular a rise in reasoning as search, an approach where the problem of finding a particular solution was basically translated as a problem of searching for paths in connected graphs that represented the available knowledge. Finding how “close” the meanings of two words were became a problem of finding the shortest path between the two associated nodes within a semantic graph. Finding the best move in a game of chess became a problem of finding the path with minimum cost or maximum profit in the graph of all the possible scenarios. Proving whether a theorem was true or false became a problem of building a decision tree out of its propositions plus the relevant axioms and finding a path that could lead either to a true or false statement. The progress in these areas led to impressive early achievements, such as ELIZA, today considered as the first example of a chatbot. Developed at the MIT between 1964 and 1966, it used to mimic a human conversation, and it may have tricked the users (at least for the first few interactions) 5
Chapter 1
IntroduCtIon to MaChIne LearnIng
that there was a human on the other side. In reality, the algorithm behind the early versions was relatively simple, as it simply repeated or reformulated some of the sentence of the user posing them as questions (to many it gave the impression of talking to a shrink), but keep in mind that we’re still talking of a few years before the first video game was even created. Such achievements led to a lot of hyper-inflated optimism into AI for the time. A few examples of this early optimism: •
1958: “Within ten years a digital computer will be the world’s chess champion, and a digital computer will discover and prove an important new mathematical theorem” [6].
•
1965: “Machines will be capable, within twenty years, of doing any work a man can do” [7].
•
1967: “Within a generation the problem of creating ‘artificial intelligence’ will substantially be solved” [8].
•
1970: “In from three to eight years we will have a machine with the general intelligence of an average human being” [9].
Of course, things didn’t go exactly that way. Around the half of the 1970s, most of the researchers realized that they had definitely underestimated the problem. The main issue was, of course, with the computing power of the time. By the end of the 1960s, researchers realized that training a network of perceptrons with multiple layers led to better results than training a single perceptron, and by the half of the 1970s, back-propagation (the building block of how networks “learn”) was theorized. In other words, the basic shape of a modern neural network was already theorized in the mid-1970s. However, training a neural-like model required a lot of CPU power to perform the calculations required to converge toward an optimal solution, and such hardware power wouldn’t have been available for the next 25–30 years.
6
Chapter 1
IntroduCtIon to MaChIne LearnIng
The reasoning as search approach in the meantime faced the combinational explosion problem. Transforming a decision process into a graph search problem was OK for playing chess, proving a geometric theorem, or finding synonyms of words, but more complex real-world problems would have easily resulted in humongous graphs, as their complexity would grow exponentially with the number of inputs—that relegated AI mostly to toy projects within research labs rather than real-world applications. Finally, researchers learned what became known as Moravec’s paradox: it’s really easy for a deterministic machine to prove a theorem or solve a geometry problem, but much harder to perform more “fuzzy” tasks such as recognizing a face or walking around without bumping into objects. Research funding drained when results failed to materialize. AI experienced a resurgence in the 1980s under the form of expert systems. An expert system is a software that answers questions or interprets the content of a text within a specific domain of knowledge, applying inference rules derived from the knowledge of human experts. The formal representation of knowledge through relational and graph-based databases introduced in the late 1970s led to this new revolution in AI that focused on how to best represent human knowledge and how to infer decisions from it. Expert systems went through another huge wave of optimism followed by another crash. While they were relatively good in providing answers to simple domain-specific questions, they were just as good as the knowledge provided by the human experts. That made them very expensive to maintain and update and very prone to errors whenever an input looked slightly different from what was provided in the knowledge base. They were useful in specific contexts, but they couldn’t be scaled up to solve more general-purpose problems. The whole framework of logic-based AI came under increasing criticism during the 1990s. Many researchers argued that a truly intelligent machine should have been designed bottom-up rather than top-down. A machine can’t make logical inference about rain and umbrellas if it doesn’t know what those things or concepts actually mean or look like—in other words, if it can’t perform some form of human-like classification based both on intuition and acquired experience. 7
Chapter 1
IntroduCtIon to MaChIne LearnIng
Such reflections gradually led to a new interest in the machine learning side of AI, rather than the knowledge-based or symbolic approaches. Also, the hardware in the late 1990s was much better than what was available to the MIT researchers in the 1960s, and simple tasks of computer vision that proved incredibly challenging at the time, like recognizing handwritten letters or detecting simple objects, could be solved thanks to Moore’s law, which states that the number of transistors in a chip doubles approximately each 18 months. And thanks to the Web and the huge amount of data it made available over the years and the increased ease at sharing this data. Today the neural network is a ubiquitous component in machine learning and AI in general. It’s important to note, however, that other approaches may still be relevant in some scenarios. Finding the quickest path by bike between your home and your office is still largely a graph search problem. Intent detection in unstructured text still relies on language models. Other problems, like some games or real-world simulations, may employ genetic algorithms. And some specific domains may still leverage expert systems. However, even when other algorithms are employed, neural networks nowadays often represent the “glue” to connect all the components. Different algorithms or networks are nowadays often modular blocks connected together through data pipelines. Deep learning has become increasingly popular over the past decade, as better hardware and more data made it possible to train more data to bigger networks with more layers. Deep learning, under the hood, is the process of solving some of the common problems with earlier networks (like overfitting) by adding more neurons and more layers. Usually the accuracy of a network increases when you increase its number of layers and nodes, as the network will be better at spotting patterns in non-linear problems. However, deep learning may be plagued by some issues as well. One of them is the vanishing gradient problem, where gradients slowly shrink as they pass through more and more layers. Another more concrete issue is related to its environmental impact: while throwing more data and more neurons at a network and running more training iterations seems to make the network more accurate, it also represents a very power-hungry solution that cannot be sustainable for long-term growth. 8
Chapter 1
IntroduCtIon to MaChIne LearnIng
1.2 Supervised and unsupervised learning Now that the difference between artificial intelligence and machine learning is clear and we have got some context of how we have got where we are now, let’s shift our focus to machine learning and the two main “learning” categories: supervised and unsupervised learning: •
We define as supervised learning the set of algorithms where a model is trained on a dataset that includes both example input data and the associated expected output.
•
In unsupervised learning, on the other hand, we train models on datasets that don’t include the expected outputs. In these algorithms, we expect the model to “figure out” patterns by itself.
When it comes to supervised learning, training a model usually consists in calculating a function y = f ( x ) that maps a given a vector of inputs x to a vector of outputs y such that the mean error between the predicted and the expected values is minimized. Some applications of supervised learning algorithms include •
Given a training set containing one million pictures of cats and one million pictures that don’t feature cats, build a model that recognizes cats in previously unseen pictures. In this case, the training set will usually include a True/False label to tell whether the picture includes a cat. This is usually considered a classification problem—that is, given some input values and their labels (together called training set), you want your model to predict the correct class, or label—for example, “does/does not contain a cat.” Or, if you provide the model with many examples of emails labelled as spam/not spam, you may train your classifier to detect spam on previously unseen emails. 9
Chapter 1
•
IntroduCtIon to MaChIne LearnIng
Given a training set containing the features of a large list of apartments in a city (size, location, construction year, etc.) together with their price, build a model that can predict the value of a new apartment on the market. This is usually considered a regression problem—that is, given a training set you want to train a model that predicts the best numeric approximation for a newly provided input in, for example, dollars, meters, or kilograms.
Unsupervised learning, on the other hand, is often used to solve problems whose goal is to find the underlying structure, distribution, or patterns in the data. Being provided with no expected labels, these are types of problems that come with no exact/correct answers nor comparison with expected values. Some examples of unsupervised learning problems include
10
•
Given a large list of customers on an ecommerce website with the relevant input features (age, gender, address, list of purchases in the past year, etc.), find the optimal way to segment your customer base in order to plan several advertisement campaigns. This is usually considered a clustering problem—that is, given a set of inputs, find the best way to group them together.
•
Given a user on a music streaming platform with its relevant features (age, gender, list of tracks listened in the past month, etc.), build a model that can recommend user profiles with similar musical taste. This is usually considered as a recommender system, or association problem—that is, a model that finds the nearest neighbors to a particular node.
Chapter 1
IntroduCtIon to MaChIne LearnIng
Finally, there can be problems that sit halfway between supervised and unsupervised learning. Think of a large production database where some images are labelled (e.g., “cat,” “dog,” etc.) but some aren’t—for example, because it’s expensive to hire enough humans to manually label all the records or because the problem is very large and providing a full picture of all the possible labels is hard. In such cases, you may want to rely on hybrid implementations that use supervised learning to learn from the available labelled data and leverage unsupervised learning to find patterns in the unlabelled data. In the rest of the book, we will mainly focus on supervised learning, since this category includes most of the neural architectures in use today, as well as all the regression problems. Some popular unsupervised learning algorithms will, however, be worth a mention, as they may be often used in symbiosis with supervised algorithms.
1.3 Preparing your tools After so much talk about machine learning, let’s introduce the tools that we’ll be using through our journey. You won’t need a Raspberry Pi (yet) during this chapter: as we cover the algorithms and the software tools used for machine learning, your own laptop will do the job. Through the next sections, I’ll assume that you have some knowledge/ experience with •
Any programming language (if you have experience with Python, even better). Python has become the most popular choice for machine learning over the past couple of years, but even if you don’t have much experience with it, don’t worry—it’s relatively simple, and I’ll try to comment the code as much as possible.
11
Chapter 1
•
IntroduCtIon to MaChIne LearnIng
High school (or higher) level of math. If you are familiar with calculus, statistics, and linear algebra, that’s even better. If not, don’t worry. Although some calculus and linear algebra concepts are required to grasp how machine learning works under the hood, I’ll try not to dig too much into the theory, and whenever I mention gradients, tensors, or recall, I’ll make sure to focus more on what they intuitively mean rather than their formal definition alone.
1.3.1 Software tools We’ll be using the following software tools through our journey:
12
•
The Python programming language (version 3.6 or higher).
•
TensorFlow, probably the most popular framework nowadays for building, training, and querying machine learning models.
•
Keras, a very popular library for neural networks and regression models that easily integrates on top of TensorFlow.
•
numpy and pandas, the most commonly used Python libraries, respectively, for numeric manipulations and data analysis.
•
matplotlib, a Python library for plotting data and images.
•
seaborn, a Python library often used for statistical data visualization.
Chapter 1
IntroduCtIon to MaChIne LearnIng
•
jupyter, a very popular solution for Python for prototyping through notebooks (and basically a standard de facto when it comes to data science in Python).
•
git, we’ll use it to download the sample datasets and notebooks from GitHub.
1.3.2 Setting up your environment •
Download and install git on your system, if it’s not available already.
•
Download and install a recent version of Python from https://python.org/downloads, if it’s not already available on your system. Make sure that you use a version of Python greater than 3 (Python 2 is deprecated). Open a terminal and check that both the python and pip commands are present.
•
(Optional) Create a new Python virtual environment. A virtual environment allows you to keep your machine learning setup separate from the main Python installation, without messing with any system-wide dependencies, and it also allows you to install Python packages in a non-privileged user space. You can skip this step if you prefer to install the dependencies system-wide (although you may need root/ administrator privileges).
13
Chapter 1
IntroduCtIon to MaChIne LearnIng
# Create a virtual environment under your home folder python -m venv $HOME/venv # Activate the environment cd $HOME/venv source bin/activate # Whenever you are done, run 'deactivate' # to go back to your standard environment deactivate
•
Install the dependencies (it may take a while depending on your connectivity and CPU power):
pip install tensorflow pip install keras pip install numpy pip install pandas pip install matplotlib pip install jupyter pip install seaborn
14
Chapter 1
IntroduCtIon to MaChIne LearnIng
•
Download the “mlbook-code” repository containing some of the datasets and code snippets that we’ll be using through this book. We’ll call the directory where you have cloned the repository.
•
Start the Jupyter server:
jupyter notebook •
Open http://localhost:8888 in your browser. You should see a login screen like the one shown in Figure 1-1. You can decide whether to authenticate using the token or set a password.
•
Select a folder where you’d like to store your notebooks (we’ll identify it as from now on), and create your first notebook. Jupyter notebooks are lists of cells; each of these cells can contain Python code, markdown elements, images, and so on. Start to get familiar with the environment, try to run some Python commands, and make sure that things work.
git clone https://github.com/BlackLight/mlbook-code
Now that all the tools are ready, let’s get our hands dirty with some algorithms.
15
Chapter 1
IntroduCtIon to MaChIne LearnIng
Figure 1-1. Jupyter login screen at http://localhost:8888
16
Chapter 1
IntroduCtIon to MaChIne LearnIng
Figure 1-2. Distribution of the house prices as a function of their size
1.4 Linear regression Linear regression is the first machine learning algorithm we’ll encounter on our journey, as well as the most important. As we’ll see, most of the supervised machine learning algorithms are variations of linear regression or applications of linear regression on scale. Linear regression is good for solving problems where you have some input data represented by n dimensions and you want to learn from their distribution in order to make predictions on future data points—for example, to predict the price of a house in a neighborhood or of a stock given a list of historical data. Linear regression is extensively used as a statistical tool also in finance, logistics, and economics to predict the price of a commodity, the demand in a certain period, or other macroeconomic variables. It’s the building block of other types of regression—like logistic regression, which in turn is the building block of neural networks. A regression model is called a simple regression model if the input data is represented by a single variable (n = 1) or multivariate regression if it operates on input data defined by multiple dimensions. It’s called linear regression if its output is a line that best approximates the input data. Other types of regression exist as well—for instance, you’ll have a quadratic regression if you try to fit a parabola instead of a line through your data, a 17
Chapter 1
IntroduCtIon to MaChIne LearnIng
cubic regression if you try to fit a cubic polynomial curve through your data, and so on. We’ll start with simple linear regression and expand its basic mechanism also to the other regression problems.
1.4.1 Loading and plotting the dataset To get started, create a new Jupyter notebook under your and load /datasets/house-size-price-1.csv. It is a CSV file that contains a list of house prices (in thousands of dollars) in function of their size (in square meters) in a certain city. Let’s suppose for now that the size is the only parameter that we’ve got, and we want to create a model trained on this data that predicts the price of a new house given its size. The first thing you should do before jumping into defining any machine learning model is to visualize your data and try to understand what’s the best model to use. This is how we load the CSV using pandas and visualize it using matplotlib:
import pandas as pd import matplotlib.pyplot as plt # Download the CSV file from GitHub csv_url = 'https://raw.githubusercontent.com/' + 'BlackLight/mlbook-code/master/' + 'datasets/house-size-price-1.csv' data = pd.read_csv(csv_url) # The first column contains the size in m2 size = data[columns[0]] # The second column contains the price in thousands of dollars
18
Chapter 1
IntroduCtIon to MaChIne LearnIng
price = data[columns[1]] # Create a new figure and plot the data fig = plt.figure() plot = fig.add_subplot() plot.set_xlabel(columns[0]) plot.set_ylabel(columns[1]) points = plot.scatter(size, price)
After running the content of your cell, you should see a graph like the one picture in Figure 1-2. You’ll notice that the data is a bit scattered, but still it can be approximated well enough if we could fit a straight line through it. Our goal is now to find the line that best approximates our data.
1.4.2 The idea behind regression Let’s put our notebook aside for a moment and try to think which characteristic such a function should have. First, it must be a line; therefore, it must have a form like this: hq ( x ) = q 0 + q1 x
(1.1)
In Equation 1.1, x denotes the input data, excluding the output labels. In the case of the house size-price model, we have one input variable (the size) and one output variable (the price); therefore, both the input and output vectors have unitary size, but other models may have multiple input variables and/or multiple output variables. θ0 and θ1 are instead the numeric coefficients of the line. In particular, θ0 tells where the line crosses the y axis and θ1 tells us the direction of the line and how “steep” it is—it’s often called the slope of the line. hθ(x) is instead the function that 19
Chapter 1
IntroduCtIon to MaChIne LearnIng
our model will use for predictions based on the values of the vector q —also called weights of the model. In our problem, the inputs x and the expected outputs y are provided through the training set; therefore, the linear regression problem is a problem of finding the q parameters in the preceding equation such that hθ(x) is a good approximation of the linear dependency between x and y. hθ(x) is often denoted as hypothesis function, or simply model. The generic formula for a single-variable model of order n (linear for n = 1, quadratic for n = 2, etc.) will be n
hq ( x ) = q 0 + q1 x + q 2 x 2 +¼+ qn x n = åqi x i
(1.2)
i =0
Also, note that the “bar” or “superscript” on top of a symbol in this book will denote a vector. A vector is a fixed-size list of numeric elements, so q is actually a more compact way to write [θ0, θ1] or [θ0…θn]. So how can we formalize the intuitive concept of “good enough linear approximation” into an algorithm? The intuition is to choose the q parameters such that their associated hθ(x) function is “close enough” to the provided samples y for the given values of x. More formally, we want to minimize the squared mean error between the sampled values y and the predicted values hθ(x) for all the m data points provided in the training set—if the error between the predicted and the actual values is low, then the model is performing well: min q 0 ,q1
2 1 m hq ( xi ) - yi ) ( å 2m i =1
(1.3)
Let’s rephrase the argument of the preceding formula as a function of the parameters q : J (q ) =
20
2 1 m hq ( xi ) - yi ) ( å 2m i =1
(1.4)
Chapter 1
IntroduCtIon to MaChIne LearnIng
The function J is also called cost function (or loss function), since it expresses the cost (or, in this case, the approximation error) of a line associated to a particular selection of the q parameters. Finding the best linear approximation for your data is therefore a problem of finding the values of q that minimize the preceding cost function (i.e., the sum of the mean squared errors between samples and predictions). Note the difference between hθ(x) and J (q ) : the former is our model hypothesis, that is, the prediction function of the model expressed by its vector of parameters q and with x as a variable. J (q ) is instead the cost function and the parameters q are the variables, and our goal is to find the values of q that minimize this function in order to calculate the best hθ(x). If we want to start formalizing the procedure, we can say that the problem of finding the optimal regression model can be expressed as follows: •
Start with a set of initial parameters q = [q 0 ,q1 ,¼,qn ] , with n being the order of your regression (n = 1 for linear, n = 2 for quadratic, etc.).
•
Use those values to formulate a hypothesis hθ(x) as shown in Equation 1.2.
•
Calculate the cost function J (q ) associated to that hypothesis as shown in Equation 1.4.
•
Keep changing the values of q until we converge on a point that minimizes the cost function.
Now that it’s clear how a regression algorithm is modelled and how to measure how good it approximates the data, let’s cover how to implement the last point in the preceding list—that is, the actual “learning” phase.
21
Chapter 1
IntroduCtIon to MaChIne LearnIng
1.4.3 Gradient descent Hopefully so many mentions of “error minimization” have rung a bell if you have some memory of calculus! The differential (or derivative) is the mathematical instrument leveraged in most of the minimization/ maximization problems. In particular: •
The first derivative of a function tells us whether that function is increasing or decreasing around a certain point or it is “still” (i.e., the point is a local minimum/maximum or an inflection point). It can be geometrically visualized as the slope of the tangent line to the function in a certain point. If we name f ′(x) the first derivative of a function f (x) (with x Î R for now), then its value in a point x0 will be (assuming that the function is differentiable in x0) ì> 0 if the curve is increasing ï f ¢ ( x 0 ) í< 0 if the curve is decreassing ï= 0 if x is a min/ max/ flex 0 î
•
The second derivative of a function tells us the “concavity” of that function around a certain point, that is, whether the curve is facing “up” or “down” around that point. The second derivative f ″(x) around a given point x0 will be ì> 0 if the curve faces up ï f ¢¢ ( x 0 ) í< 0 if the curve faces down ï= 0 if x is an inflection point 0 î
22
(1.5)
(1.6)
Chapter 1
IntroduCtIon to MaChIne LearnIng
By combining these two instruments, we can find out, given a certain point on a surface, in which direction the minimum of that function is. In other words, minimizing the cost function J (q ) is a problem of finding a vector of values q * = [q 0 ,q1 ,¼,qn ] such that the first derivative of J for q * is zero (or close enough to zero) and its second derivative is positive (i.e., the point is a local minimum). In most of the algorithms in this book, we won’t actually need the second derivative (many of today’s popular machine learning models are built around convex cost functions, i.e., cost functions with one single point of minimum), but applications with higher polynomial models may leverage the idea of concavity behind the second derivative to tell whether a point is a minimum, a maximum, or an inflection. This intuition works for the case with a single variable. However, J is a function of a vector of parameters q , whose length is 2 for linear regression, 3 for quadratic, and so on. For multivariate functions, the concept of gradient is used instead of the derivative used for univariate functions. The gradient of a multivariate function (usually denoted by the symbol 𝛻) is the vector of the partial derivatives (conventionally denoted by the ∂ symbol) of that function calculated against each of the variables. In the case of our cost function, its gradient will be é ¶ ù ê ¶q J (q ) ú ê 0 ú Ñ J (q ) = ê ú ê ú ê ¶ J (q ) ú êë ¶qn úû
(1.7)
In practice, a partial derivative is the process of calculating the derivative of a multivariate function assuming that only one of its variables is the actual variable and the others are constants.
23
Chapter 1
IntroduCtIon to MaChIne LearnIng
Figure 1-3. Typical shape of the cost function for a linear regression model: a paraboloid in a 3D space The gradient vector indicates the direction in which an n-dimensional curve is increasing around a certain point and “how fast” it is increasing. In the case of linear regression with one input variable, we have seen in Equation 1.4 that the cost function is a function of two variables (θ0 and θ1) and that it is a quadratic function. In fact, by combining Equations 1.1 and 1.4, we can derive the cost function for linear regression with univariate input: J (q 0 ,q1 ) =
1 m 2 (q0 + q1 xi - yi ) å 2m i =1
(1.8)
This surface can be represented in a 3D space like a paraboloid (a bowl shape that you get if you rotate 360 degrees a parabola around its axis, as shown in Figure 1-3). That puts us in a relatively lucky spot. Just like a parabola, a paraboloid has only one minimum—that is, only one point where its vector gradient is zero, and that point also happens to be the global minimum. It means that if we start with some random values
24
Chapter 1
IntroduCtIon to MaChIne LearnIng
for q and then start “walking” in the opposite direction of the gradient in that point, we should eventually end up in the optimal point—that is, * the vector of parameters q that we can plug in our hypothesis function hθ(x) to get good predictions. Remember that, given a point on a surface, the gradient vector tells you in which direction the function goes “up.” If you go in its opposite direction, you will be going down. And if your surface has the shape of a bowl, if you keep going down, from wherever you are on the surface, you will eventually get to the bottom. If you instead use more complex types of regression (quadratic, cubic, etc.), you may not always be so lucky—you could get stuck in a local minimum, which does not represent the overall minimum of the function. So, assuming that we are happy as soon as we start converging toward a minimum, the gradient descent algorithm can be expressed as a procedure where we start by initially picking up random values for q , and then at each step k, we update these values through the following formula:
qi( k +1) = qi( k ) - a
¶ J q 0( k ) ,q1( k ) for i = 0 ,1 ¶qi
(
)
(1.9)
Or, in vectorial form:
( )
q ( k + 1) = q ( k ) - a Ñ J q ( k )
(1.10)
α is a parameter between zero and one known as the learning rate of the machine learning model, and it expresses how fast the model should learn from new data—in other words, how big the “leaps” down the direction of the gradient vector should be. A large value of α would result in a model that could learn fast at the beginning, but that could overshoot the minimum or “miss the stop”—at some point, it could land near the minimum of the cost function and could miss it by taking a longer step,
25
Chapter 1
IntroduCtIon to MaChIne LearnIng
resulting in some “bounces” back and forth before converging. A small value of α, on the other hand, may be slow at learning at the beginning and could take more iterations, but it’s basically guaranteed to converge without too many bounces around the minimum. To use a metaphor, performing learning through gradient descent is like pushing a ball blindfolded down a valley. First, you have to figure out in which direction the bottom of the valley is, using the direction of the gradient vector as a compass. Then, you have to find out how much force you want to apply to the ball to get it to the bottom. If you push it with a lot of force, it may reach the bottom faster, but it may go back and forth a few times before settling. Instead, if you only let gravity do its work, the ball may take longer to get to the bottom, but once it’s there, it’s unlikely to swing too much. In most of today’s applications, the learning rate is dynamic: you may usually want a higher value of α in the first phase (when your model doesn’t know much yet about the data) and lower it toward the end (when your model has already seen a significant number of data points and/or the cost function is converging). Some popular algorithms nowadays also perform some kind of learning rate shuffling in case they get stuck. We may want to also set an exit condition for the algorithm: in case it doesn’t converge on a vector of parameters that nullifies the cost function, for example, we may want to still exit if the gradient of the cost function is close enough to zero, or there hasn’t been a significant improvement from the previous step, or the corresponding model is already good enough at approximating our problem. By combining Equations 1.8 and 1.9 and applying the differentiation rules, we can derive the exact update steps for θ0 and θ1 in the case of linear regression:
q 0( k +1) = q 0( k ) -
26
a m (k ) å q0 + q1(k ) xi - yi m i =1
(
)
(1.11)
Chapter 1
q1( k +1) = q1( k ) -
IntroduCtIon to MaChIne LearnIng
a m (k ) q 0 + q1( k ) xi - yi xi å m i =1
(
)
(1.12)
So the full algorithm for linear regression can be summarized as follows: 1. Pick random values for θ0 and θ1. 2. Keep updating θ0 and θ1 through, respectively, Equations 1.11 and 1.12. 3. Terminate either after performing a preset number of training iterations (often called epochs) or when convergence is achieved (i.e., no significant improvement has been measured on a certain step compared to the step before). There are many tools and library that can perform efficient regression, so it’s uncommon that you will have to implement the algorithm from scratch. However, since the preceding steps are known, it’s relatively simple to write a univariate linear regression algorithm in Python with numpy alone:
import numpy as np def gradient_descent(x, y, theta, alpha): m = len(x) new_theta = np.zeros(2) # Perform the gradient descent on theta for i in range(m): new_theta[0] += theta[0] + theta[1]*x[i] - y[i]
27
Chapter 1
IntroduCtIon to MaChIne LearnIng
new_theta[1] += (theta[0] + theta[1]*x[i] - y[i]) * x[i] return theta - (alpha/m) * new_theta def train(x, y, steps, alpha=0.001): # Initialize theta randomly theta = np.random.randint(low=0, high=10, size=2) # Perform the gradient descent times for i in range(steps): theta = gradient_descent(x, y, theta, alpha) # Return the linear function associated to theta return lambda x: theta[0] + theta[1] * x
1.4.4 Input normalization Let’s pick up our notebook where we left it. We should now have a clear idea of how to train a regression model to predict the price of a house given its size. We will use TensorFlow+Keras for defining and training the model. Before proceeding, however, it’s important to spend a word about input normalization (or feature scaling). It’s very important that your model is robust enough even if some data is provided in different units than those used in the training set or some arbitrary constant is added or multiplied to your inputs. That’s why it’s important to normalize your inputs before feeding it to any machine learning model. Not only that, but if the inputs are well distributed within a specific range centered around the origin, the model will be much faster at converging than if you provide raw inputs with no specific range. Worse, a non-normalized training set could often result in a cost function that doesn’t converge at all.
28
Chapter 1
IntroduCtIon to MaChIne LearnIng
Input normalization is usually done by applying the following transformation: xˆi =
xi - m for i = 1,¼,m s
(1.13)
where xi is the i-th element of your m-long input vector, μ is the arithmetic mean of the vector x of the inputs, and σ is its standard deviation: 1 m åxi m i =1
(1.14)
1 m 2 å ( xi - m ) m i =1
(1.15)
m=
s=
By applying Equation 1.13 to our inputs, we basically transpose the input values around the zero and group most of them around the [−σ, σ] range. When predicting values, we instead want to denormalize the provided output, which can easily be done by deriving xi from Equation 1.13: xi = s xˆi + m
(1.16)
It’s quite easy to write these functions in Python and insert them in our notebook. First, get the dataset stats using the pandas describe method: dataset_stats = data.describe()
29
Chapter 1
IntroduCtIon to MaChIne LearnIng
Then define the functions to normalize and denormalize your data: def normalize(x, stats): return (x - stats['mean']) / stats['std'] def denormalize(x, stats): return stats['std'] * x + stats['mean'] norm_size = normalize(size, dataset_stats['size']) norm_price = normalize(price, dataset_stats['price'])
1.4.5 Defining and training the model Defining and training a linear regression model with TensorFlow+Keras is quite easy: from tensorflow.keras.experimental import LinearModel model = LinearModel(units=1, activation='linear', dtype='float32') model.compile(optimizer='rmsprop', loss='mse', metrics=['mse']) history = model.fit(norm_size, norm_price, epochs=1000, verbose=0)
30
Chapter 1
IntroduCtIon to MaChIne LearnIng
There are quite a few things happening here: •
First, we define a LinearModel with one input variable (units=1), linear activation function (i.e., the output value of the model is returned directly without being transformed by a non-linear output function), and with float32 numeric type.
•
Then, we compile the model, making it ready to be trained. The optimizer in Keras does many things. A deep understanding of the optimizers would require a dedicated chapter, but for sake of keeping it brief, we will quickly cover what they do as we use them. rmsprop initializes the learning rate and gradually adjusts it over the training iterations as a function of the recent gradients [10]. By default, rmsprop is initialized with learning_rate=0.001. You can, however, try and tweak it and see how/if it affects your model:
from tensorflow.keras.optimizers import RMSprop rmsprop = RMSprop(learning_rate=0.005) model.compile(optimizer=rmsprop, loss='mse', metrics=['mse'])
•
Other common optimizers include – SGD, or stochastic gradient descent, which implements the gradient descent algorithm described in Equation 1.9 with few optimizations and tweaks, such as learning rate decay and Nesterov momentum [11]
31
Chapter 1
IntroduCtIon to MaChIne LearnIng
– adam, an algorithm for first-order gradient-based optimization that has recently gain quite some momentum, especially in deep learning [12] – nadam, which implements support for Nesterov momentum on top of the adam algorithm [13] •
Feel free to experiment with different optimizers and different learning rates to see how it affects the performance of your model.
•
The loss parameter defines the loss/cost function to be optimized. mse means mean squared error, as we have defined it in Equation 1.4. Other common loss functions include – mae, or mean absolute error—similar to mse, but it uses the absolute value of h(xi) − yi in Equation 1.4 instead of the squared value. – mape, or mean absolute percentage error—similar to mae, but it uses the percentage of the absolute error compared to the previous iteration as a target metric. – mean_squared_logarithmic_error—similar to mse, but it uses the logarithm of the mean squared error (useful if your curve has exponential features). – Several cross-entropy loss functions (e.g., categorical_crossentropy, sparse_categorical_crossentropy, and binary_ crossentropy), often used for classification problems.
•
32
The metrics attribute is a list that identifies the metrics to be used to evaluate the performance of your model. In this example, we use the same metric used as the loss/cost function (the mean squared error), but other metrics can be used as well. A metric is conceptually
Chapter 1
IntroduCtIon to MaChIne LearnIng
similar to the loss/cost function, except that the result of the metric function is only used to evaluate the model, not to train it. You can also use multiple metrics if you want your model to be evaluated according to multiple features. Other common metrics include – mae, or mean absolute error. – accuracy and its derived metrics (binary_accuracy, categorical_accuracy, sparse_categorical_accuracy, top_k_categorical_accuracy, etc.). Accuracy is often used in classification problems, and it expresses the fraction of correctly labelled items in the training set over the total amount of items in the training set. – Custom metrics can also be defined. As we’ll see later when we tackle classification problems, precision, recall, and F1 score are quite popular evaluation metrics. These aren’t part of the core framework (yet?), but they can easily be defined. •
Finally, we train our model on the normalized data from the previous step using the fit function. The first argument of the function will be the vector of input values, the second argument will be the vector of expected output values, and then we specify the number of epochs, that is, training iterations that we want to perform on this data.
The epochs value depends a lot on your dataset. The cumulative number of input samples you will present to your model is given by × epochs, where m is the size of your dataset. In our case, we’ve got a relatively small dataset, which means that you want to run more training epochs to make sure that your model has seen enough data. If you have larger training sets, however, you may want to run less training iterations. The risk of running too many training iterations on the same batch of data, as we will 33
Chapter 1
IntroduCtIon to MaChIne LearnIng
see later, is to overfit your data, that is, to create a model that closely mimics the expected outputs if presented with values close enough to those it’s been trained on, but inaccurate over data points it hasn’t been trained on.
1.4.6 Evaluating your model Now that we have defined and trained our model and we have a clear idea of how to measure its performance, let’s take a look at how its primary metric (mean squared error) has improved during the training phase: epochs = history.epoch loss = history.history['loss'] fig = plt.figure() plot = fig.add_subplot() plot.set_xlabel('epoch') plot.set_ylabel('loss') plot.plot(epochs, loss) You should see a plot like the one shown in Figure 1-4. You’ll notice that the loss curve goes drastically down—which is good; it means that our model is actually learning when we input it with data points, without getting stuck in gradient “valleys.” It also means that the learning rate was well calibrated—if your learning rate is too high, the model may not necessarily converge, and if it’s too low, then it may still be on its way toward convergence after many training epochs. Also, there is a short tail around 0.07. This is also good: it means that our model has converged over the last iterations.
34
Chapter 1
IntroduCtIon to MaChIne LearnIng
Figure 1-4. Linear regression model loss evolution over training If the tail is too long, it means that you have trained your model for too many epochs, and you may want to reduce either the number of epochs or the size of your training set to prevent overfit. If the tail is too short or there’s no tail at all, your model hasn’t been trained on enough data points, and probably you will have to either increase the size of your training set or the number of epochs to gain accuracy. You can (read: should) also evaluate your model on some data that isn’t necessarily part of your training set to check how well the model performs on new data points. We will dig deeper on how to split your training set and test set later. For now, given the relatively small dataset, we can evaluate the model on the dataset itself through the evaluate function:
model.evaluate(norm_size, norm_price) You should see some output like this: 0s 2ms/step - loss: 0.0733 - mse: 0.0733 [0.0733143612742424, 0.0733143612742424] The returned vector contains the values of the loss function and metric functions, respectively—in our case, since we used mse for both, they are the same. 35
Chapter 1
IntroduCtIon to MaChIne LearnIng
Finally, let’s see how the linear model actually looks against our dataset and let’s start to use it to make predictions. First, define a predict function that will 1. Take a list of house sizes as input and normalize them against the mean and standard deviation of your training set 2. Query the linear model to get the predicted prices 3. Denormalize the output using the mean and standard deviation of the training set to convert the prices in thousands of dollars
Figure 1-5. Plotting your linear model against the dataset def predict_prices(*x): x = normalize(x, dataset_stats['size']) return denormalize(model.predict(x), dataset_ stats['price'])
36
Chapter 1
IntroduCtIon to MaChIne LearnIng
And let’s use this function to get some predictions of house prices: predict_prices(90) array([[202.53752]], dtype=float32) The “predict” function will return a list of output associated to your inputs, where each item is a vector containing the predicted values (in this case, vectors of unitary size, because our model has a single output unit). If you look back at Figure 1-2, you’ll see that a predicted price of 202.53752 for size = 90 doesn’t actually look that far from the distribution of the data—and that’s good. To visualize how our linear model looks against our dataset, let’s plot the dataset again and let’s calculate two points on the model in order to draw the line: # Draw the linear regression model as a line between the first and # the last element of the numeric series. x will contain the lowest # and highest size (assumption: the series is ordered) and y will # contain the price predictions for those inputs. x = [size[0], size.iat[-1]] y = [p[0] for p in predict_prices(*x)] # Create a new figure and plot both the input data points and the # linear model approximation. fig = plt.figure()
37
Chapter 1
IntroduCtIon to MaChIne LearnIng
data_points = fig.add_subplot() data_points.scatter(size, price) data_points.set_xlabel(columns[0]) data_points.set_ylabel(columns[1]) model_line = fig.add_subplot() model_line.plot(x, y, 'r') The output of the preceding code will hopefully look like Figure 1-5. That tells us that the line calculated by the model isn’t actually that far from our data. If you happen to see that your line is far from the model, it probably means that you haven’t trained your model on enough data, or the learning rate is too high/too low, or that there’s no strong linear correlation between the metrics in your dataset and maybe you need a higher polynomial model, or maybe that there are many “outliers” in your model—that is, many data points outside of the main distribution that “drag” the line out.
1.4.7 Saving and loading your model Your model is now loaded in memory in your notebook, but you will lose it once the Jupyter notebook stops. You may want to save it to the filesystem, so you can recover it later without going through the training phase again. Or you can include it in another script to make predictions. Luckily, it’s quite easy to save and load Keras models on the filesystem—the following examples, however, will assume that you are using a version of TensorFlow \geq 2.0:
38
Chapter 1
IntroduCtIon to MaChIne LearnIng
def model_save(model_dir, overwrite=True): import json import os os.makedirs(model_dir, exist_ok=True) # The TensorFlow model save won't keep track of the labels of # your model. It's usually a good practice to store them in a # separate JSON file. labels_file = os.path.join(model_dir, 'labels.json') with open(labels_file, 'w') as f: f.write(json.dumps(list(columns))) # Also, you may want to keep track of the x and y mean and # standard deviation to correctly normalize/denormalize your # data before/after feeding it to the model. stats = [ dict(dataset_stats['size']), dict(dataset_stats['price']), ] stats_file = os.path.join(model_dir, 'stats.json') with open(stats_file, 'w') as f: f.write(json.dumps(stats)) # Then, save the TensorFlow model using the save primitive model.save(model_dir, overwrite=overwrite)
39
Chapter 1
IntroduCtIon to MaChIne LearnIng
You can then load the model in another notebook, script, application, and so on (bindings of TensorFlow are available for most of the programming languages in use nowadays) and use it for your predictions: def model_load(model_dir): import json import os from tensorflow.keras.models import load_model labels = [] labels_file = os.path.join(model_dir, 'labels.json') if os.path.isfile(labels_file): with open(labels_file) as f: labels = json.load(f) stats = [] stats_file = os.path.join(model_dir, 'stats.json') if os.path.isfile(stats_file): with open(stats_file) as f: stats = json.load(f) m = load_model(model_dir) return m, stats, labels model, stats, labels = model_load(model_dir) price = predict_prices(90)
40
Chapter 1
IntroduCtIon to MaChIne LearnIng
1.5 Multivariate linear regression So far we have explored linear regression models with one single input and output variable. Real-world regression problems are usually more complex, and the output features are usually expressed as a function of multiple variables. The price of a house, for instance, won’t depend only on its size but also on its construction year, number of bedrooms, presence of extras such as garden or terrace, distance from the city center, and so on. In such a generic case, we express each input data point as a vector x = ( x1 ,x 2 ,¼,xn ) Î R n , and the regression expression seen in Equation 1.2 is reformulated as n
hq ( x ) = q 0 + q1 x1 + q 2 x 2 +¼+ qn xn = q 0 + åqi xi
(1.17)
i =1
By convention, the input vector in case of multivariate regression is rewritten as x = ( x 0 ,x1 ,x 2 ,¼,xn ) Î R n +1 , with x0 = 1, so the preceding expression can be written more compactly as n
hq ( x ) = åqi xi
(1.18)
i =0
Or, by using the vectorial notation, the hypothesis function can be written as the scalar product between the vector of parameters q and the vector of features x : é x0 ù hq ( x ) = [q 0 , ¼, qn ] êê úú = q T x êë xn úû
(1.19)
41
Chapter 1
IntroduCtIon to MaChIne LearnIng
Keep in mind that, by convention, vectors are represented as columns of values. The T notation denotes the transposed vector, so the vector represented as a row. By multiplying a row vector for a column vector, you get the scalar product of the two vectors, so q T x is a compact way to represent the scalar product of q into x . So the mean squared error cost function in Equation 1.8 can be rewritten in the multivariate case as 1 mæ n (i ) (i ) ö J (q ) = ç åq j x j - y ÷ å 2m i =1 è j = 0 ø
2
(1.20)
Or, using the vectorial notation: J (q ) =
1 m T (i ) q x - y (i ) å 2m i =1
(
)
2
(1.21)
Since the inputs are no longer unitary, we are no longer talking of lines on a plane, but of hyper-surfaces in an n-dimensional space—they are a 1D line in a 2D space defined by one variable, a 2D plane in a 3D space defined by two variables, a 3D space in a 4D space defined by three variables, and so on. The cost function, on its side, will have q = (q 0 ,q1 ,¼,qn ) Î R n +1 parameters—while it was a paraboloid surface in a 3D space in the univariate case, it will be an n + 1-dimensional surface in the case of n input features. This makes the multivariate case harder to visualize than the univariate, but we can still rely on our performance metrics to evaluate how well the model is doing or break down the linear n-dimensional surface by feature to analyze how each variable performs against the output feature.
42
Chapter 1
IntroduCtIon to MaChIne LearnIng
By applying the generic vectorial equation for gradient descent shown in Equation 1.10, we can also rewrite the parameters’ update formulas in Equations 1.11 and 1.12 in the following way (remember that x0 = 1 by convention):
q j( k +1) = q j( k ) -
a m T (i ) (i ) (i ) å q x - y xj m i =1
(
)
(1.22)
We should now have all the tools also to write a multivariate regression algorithm with numpy alone: import numpy as np def gradient_descent(x, y, theta, alpha): m = len(x) n = len(theta) new_theta = np.zeros(n) # Perform the gradient descent on theta for i in range(m): # s = theta[0] + (theta[1]*x[1] + .. + theta[n]*x[n]) y[i] s = theta[0] for j in range(1, n): s += theta[j]*x[i][j-1] s -= y[i] new_theta[0] += s for j in range(1, n):
43
Chapter 1
IntroduCtIon to MaChIne LearnIng
new_theta[j] += s * x[i][j-1] return theta - (alpha/m) * new_theta def train(x, y, steps, alpha=0.001): # Initialize theta randomly theta = np.random.randint(low=0, high=10, size=len(x[0])+1) # Perform the gradient descent times for i in range(steps): theta = gradient_descent(x, y, theta, alpha) # Return the linear function associated to theta def model(x): y = theta[0] for i in range(len(theta)-1): y += theta[i+1] * x[i] return y return model These changes will make the linear regression algorithm analyzed for the univariate case work also in the generic multivariate case. In the preceding code, I have expanded all the vectorial operations into for statements for sake of clarity, but keep in mind that most of the real regression algorithms out there perform vector sums and scalar products using native vectorial operations if available. You should have noticed by now that the gradient descent is a quite computationally expensive procedure. The algorithm goes over an m-sized dataset for epochs times
44
Chapter 1
IntroduCtIon to MaChIne LearnIng
and performs a few vector sums and scalar products each time, and each dataset item consists of n features. Things get even more computationally expensive once you perform gradient descent on multiple nodes, like in a neural network. Expressing the preceding steps in vectorial form allows to take advantage of some optimizations and parallelizations available either in the hardware or the software. I’ll leave it as a take-home exercise to write the preceding n-dimensional gradient descent algorithm using vectorial primitives. Before jumping into a practical example, let me spend a few words on two quite important topics in multi-feature problems: feature selection and training/test set split.
1.5.1 Redundant features Adding more input features to your model will usually make your model more accurate in real-world examples. In the first regression problem we have solved, our model could predict the price of a house solely based on its size. We know by intuition that adding more features from real-world examples usually would make a price prediction more accurate. We know by intuition that if we also input features such as the year of construction of the house, the average price of the houses on that road, the presence of balconies or gardens, or the distance from the city center, we may get more accurate predictions. However, there is a limit to that. It is quite important that the features that you feed to your model are linearly independent from each other. Given a list of vectors [ x1 ,x 2 , ¼, xm ] , a vector xi in this list is defined as linearly dependent if it can be written as m
xi = åk j x j
(1.23)
j =1
45
Chapter 1
IntroduCtIon to MaChIne LearnIng
m with k = [ k1 ,k2 ,¼,km ] Î R . In other words, if a feature can be expressed as a linear combination of any other number of features, then that feature is redundant. Example of redundant features include
•
The same price expressed both in euros and in dollars
•
The same distance expressed both in meters and in feet or in meters and kilometers
•
A dataset that contains the net weight, the tare weight, and the gross weight of a product
•
A dataset that contains the base price, the VAT rate, and the final price of a product
The preceding scenarios are all examples of features that are linear combinations of other features. You should remove all the redundant features from your dataset before feeding it to a model; otherwise, the predictions of your model will be skewed, or biased, toward the features that make up the derived feature(s). There are two ways of doing this: 1. Manually: Look at your data, and try to understand if there are any attributes that are redundant—that is, features that are linear combinations of other features. 2. Analytically: You will have m inputs in your dataset, each represented by n features. You can arrange them into an m × n matrix X Î R m´n . The rank of this matrix, ρ(X), is defined as its number of linearly independent vectors (rows or columns). We can say that our dataset has no redundant features if its associated matrix X is such that
r ( X ) = min ( m ,n )
46
(1.24)
Chapter 1
IntroduCtIon to MaChIne LearnIng
If ρ(X) = min(m,n)–1, then the dataset has one linearly dependent vector. If ρ(X) = min(m,n)–2, then it has two linearly dependent vectors, and so on. Both numpy and scipy have built-in methods to calculate the rank of a matrix, so that may be a good way to go if you want to doublecheck that there are no redundant features in your dataset. It’s also equally important to try and remove duplicate rows in your dataset, as they could “pull” your model in some specific ranges. However, the impact of duplicate rows in a dataset won’t be as bad as duplicate columns if m ≫ n—that is, if the number of data samples is much greater than the number of features.
1.5.2 Principal component analysis An analytical (and easy to automate) way to remove redundant features from your training set is to perform what’s known as principal component analysis, or PCA. PCA is especially useful when you have a really high number of input features and performing a manual analysis of the functional dependencies could be cumbersome. PCA is an algorithm for feature extraction, that is, it reduces the number of dimensions in an n-dimensional input space A by mapping the points in A to a new k-dimensional space A′, with k ≤ n, such that the features in A′ are linearly independent from each other—or, in other words, they are orthogonal to each other. The math behind PCA may seem a bit hard at a first look, but it relies on a quite intuitive geometric idea, so bear with me the next couple of formulas. The first step in PCA is feature normalization, as we have seen it in Equation 1.13: xˆi =
xi - m x for i = 1,¼,n sx
(1.25)
with μx denoting the arithmetic mean of x and σx its standard deviation.
47
Chapter 1
IntroduCtIon to MaChIne LearnIng
Then, given a normalized training set with input features x = [ x1 , x 2 , ¼, xn ] , we calculate the covariance matrix of x as cov ( x , x ) = ( x - m x ) ( x - m x )
T
é ( x1 - m x ) ( x1 - m x ) ¼ ê ( x2 - m x ) ( x1 - m x ) ¼ =ê ê ê êë( xn - m x ) ( x1 - m x ) ¼
( x1 - m x )( xn - m x ) ù ( x2 - m x ) ( xn - m x ) úú
ú ú ( xn - m x ) ( xn - m x )úû
(1.26)
You may recall from linear algebra that the product of two vectors in the form row into column, x T y , is what’s usually called scalar product or dot product and it returns a single real number, while the product of two vectors in the form column into row, x y T , returns an n × n sized matrix, and that’s what we get with the preceding covariance matrix. The intuition behind the covariance matrix of a vector with itself is to have a geometrical representation of the input distribution—it’s like a multi-dimensional extension of the concept of variance in the case of one dimension. We then proceed with calculating the eigenvectors of cov ( x ,x ) . The eigenvector of a matrix is defined as a non-zero vector that remains unchanged (or at most is scaled by a scalar λ, called eigenvalue) when we apply the geometric transformation described by the matrix on it. For example, consider the spinning movement of our planet around its axis. The rotation movement can be mapped into a matrix that, given any point on the globe, can tell where that point will be located after applying a rotation to it. The only points whose locations don’t change when the rotation is applied are those along the rotation axis. We can then say that the rotation axis is the eigenvector of the matrix associated to the rotation of our planet and that, at least in this specific case, the eigenvalue for that vector is λ = 1—the points along the axis don’t change at all during the rotation; they are not even scaled. The case of the rotation of a sphere has a
48
Chapter 1
IntroduCtIon to MaChIne LearnIng
single eigenvector (the rotation axis), but other geometric transformations might have multiple eigenvectors, each with a different associated eigenvalue. To formalize this intuition, we say that, given a matrix A Î R n´n that describes a certain geometric transformation in an n-dimensional space, its eigenvector v must be a vector that is at most scaled by a factor λ when we apply A to it:
Figure 1-6. Principal component analysis of a normalized dataset. The vectors in green indicate the components that are more influential in the distribution of the data. You may want to map your input data to this new system of coordinates. You can also select only the largest vector without actually losing a lot of information Av = lv
(1.27)
By grouping v in the preceding equation, we get v ( A - lIn ) = 0
(1.28)
49
Chapter 1
IntroduCtIon to MaChIne LearnIng
where In is the identity matrix (a matrix with 1s on the main diagonal and 0s everywhere else) of size n × n. The preceding vectorial notation can be expanded into a set of n equations, and by solving them, we can get the eigenvalues λ of A. By replacing the eigenvalues in the preceding equation, we can then get the eigenvectors associated to the matrix. There is a geometric intuition behind computing the eigenvectors of the auto-covariance matrix. Those eigenvectors indicate in which direction in the n-dimensional input space you have the highest “spread” of the data. Those directions are the principal components of your input space, that is, the components that are more relevant to model the distribution of your data, and therefore you can map the original space into the newer space with lower dimensions without actually losing much information. If an input feature can be expressed as the linear combination of some other input features, then the covariance matrix of the input space will have two or more eigenvectors with the same eigenvalue, and therefore, the number of dimensions can be collapsed. You can also decide to prune some features that only marginally impact the distribution of your data, even if they are not strictly linearly dependent, by choosing the k eigenvectors with the k highest associated eigenvalues—intuitively, those are the components that matter the most in the distribution of your data—an example is shown in Figure 1-6. Once we have the principal components of our input space, we need to transform our input space by reorienting its axes along the eigenvectors— note that those eigenvectors are orthogonal to each other, just like the axes of a Cartesian plane. Let’s call W the matrix constructed from the selected principal components (eigenvectors of the auto-covariance matrix). Given a normalized dataset represented by a matrix X, its points will be mapped into the new space through Xˆ = XW
50
(1.29)
Chapter 1
IntroduCtIon to MaChIne LearnIng
We will then train our algorithms on the new matrix Xˆ , whose number of dimensions will be equal or lower than the initial number of features without significant information loss. Many Python libraries for machine learning and data science already feature some functions for principal component analysis. An example that uses “scikit-learn”: import numpy as np from sklearn.decomposition import PCA # Input vector x = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]]) # Define a PCA model that brings the components down to 2 pca = PCA(n_components=2) # Fit the input data through the model pca.fit(x) PCA has some obvious advantages—among those, it reduces the dimensionality of an input space with a high number of features analytically, reduces the risk of overfit, and improves the performance of your algorithms. However, it maps the original input space into a new space built out of the principal components, and those new synthetic features may not be intuitive to grasp as real-world features such as “size,” “distance,” or “time.” Also, if you pick a number of components that is lower than the actual number of components that influence your data, you may lose information, so it’s a good practice to usually compare the performance of a model before and after applying PCA to check whether you have removed some features that you actually need.
51
Chapter 1
IntroduCtIon to MaChIne LearnIng
1.5.3 Training set and test set The first linear regression example we saw was trained on a quite small dataset; therefore, we decided to both train and test the model on the same dataset. In all the real-world scenarios, however, you will usually train your model on very large datasets, and it’s important to evaluate your model on data points that your model hasn’t been trained on. To do so, the input dataset is conventionally split into two: 1. The training set contains the data points your model will be trained on. 2. The test set contains the data points your model will be evaluated on.
Figure 1-7. Example of a 70/30 training set/test set split
Figure 1-8. A look at the Auto MPG dataset This is usually done by splitting your dataset according to a predefined fraction—the items on the left of the pivot will make the training set, and the ones on the right will make the test set, as shown in Figure 1-7. A few observations on the dataset split:
52
Chapter 1
IntroduCtIon to MaChIne LearnIng
1. The split fraction you want to choose depends largely on your dataset. If you have a very large dataset (let’s say millions of data points or more), then you can select a large training set split (e.g., 90% training set and 10% test set), because even if the fraction of the test set is small, it will still include tens or hundreds of thousands of items, and that will still be significant enough to evaluate your model. In scenarios with smaller dataset, you may want to experiment with different fractions to find the best trade-off between exploitation of the available data for training purposes and statistic significance of the test set selected to evaluate your model. In other words, you may want to find the tradeoff between exploitation of the available data for training purposes and evaluation of your model on a statistically significant set of data. 2. If your dataset is sorted according to some feature, then make sure to shuffle it before performing the split. It is quite important that the data your model is both trained and evaluated on is as uniform as possible.
1.5.4 Loading and visualizing the dataset In this example, we will load the Auto MPG dataset [15], a dataset that includes several parameters about 1970s–1980s cars (cylinders, weight, acceleration, year, horsepower, fuel efficiency, etc.). We want to build a model that predicts fuel efficiency of a car from those years given the respective input features.
53
Chapter 1
IntroduCtIon to MaChIne LearnIng
First, let’s download the dataset, load it in our notebook, and take a look at it: import pandas as pd import matplotlib.pyplot as plt dataset_url = 'http://archive.ics.uci.edu/ml/' + 'machine-learning-databases/' + 'auto-mpg/auto-mpg.data' # These are the dataset columns we are interested in columns = ['MPG','Cylinders','Displacement', 'Horsepower', 'Weight', 'Acceleration', 'Model Year'] # Load the CSV file dataset = pd.read_csv(dataset_url, names=columns, na_values = "?", comment='\t', sep=" ", skipinitialspace=True) # The dataset contains some empty cells - remove them dataset = dataset.dropna() # Take a look at the last few records of the dataset dataset.tail()
54
Chapter 1
IntroduCtIon to MaChIne LearnIng
You will probably see a table like the one shown in Figure 1-8. Let’s also take a look at how the features correlate to each other. We will use the seaborn library for it and in particular the pairplot method to visualize how each metric’s data points plot against each of the other metrics: sns.pairplot(dataset[["MPG", "Cylinders", "Displacement", "Weight"]], diag_kind="kde") You should see a plot like the one shown in Figure 1-9. Each graph plots the data points broken down by a pair of metrics: that’s a useful way to spot correlations. You’ll notice that the number of cylinders, displacement, and weight are quite related to MPG, while other metrics are more loosely related against each other. We will now split the dataset into two, as shown in Section 1.5.3. The training set will contain 80% of the data and the test set the remaining 20%: # Random state initializes the random seed for randomizing the # seed. If None then it will be calculated automatically train_dataset = dataset.sample(frac=0.8, random_state=1) # The test dataset contains all the records after the split test_dataset = dataset.drop(train_dataset.index) # Fuel efficiency (MPG) will be our output label, so drop # it from the training and test datasets train_labels = train_dataset.pop('MPG') test_labels = test_dataset.pop('MPG')
55
Chapter 1
IntroduCtIon to MaChIne LearnIng
And normalize the data:
Figure 1-9. A look at how each feature relates to each other feature through seaborn def normalize(x, stats): return (x - stats['mean']) / stats['std'] def denormalize(x, stats): return x * stats['std'] + stats['mean'] 56
Chapter 1
IntroduCtIon to MaChIne LearnIng
data_stats = train_dataset.describe().transpose() label_stats = train_labels.describe().transpose() norm_train_data = normalize(train_dataset, data_stats) norm_train_labels = normalize(train_labels, label_stats) Then, just like in the previous example, we will define a linear model and try to fit it through our data: from tensorflow.keras.experimental import LinearModel model = LinearModel(len(train_dataset.keys()), activation='linear', dtype='float32') model.compile(optimizer='rmsprop', loss='mse', metrics=['mae', 'mse']) history = model.fit(norm_train_data, norm_train_labels, epochs=200, verbose=0) The differences this time are •
The model in the previous example had a single input unit; this one has as many input units as the columns of our training set (excluding the output features).
•
We use two evaluation metrics this time, both mae and mse. In most of the cases, it’s a good practice to keep a primary evaluation metric other than the loss function.
57
Chapter 1
IntroduCtIon to MaChIne LearnIng
Let’s plot again the loss function over the training iterations: epochs = history.epoch loss = history.history['loss'] fig = plt.figure() plot = fig.add_subplot() plot.set_xlabel('epoch') plot.set_ylabel('loss') plot.plot(epochs, loss) You should see a figure like the one shown in Figure 1-10. Also, we can now evaluate the model on the test set and see how it performs on data it has not been trained on:
Figure 1-10. Loss function progress over the multivariate regression training
58
Chapter 1
IntroduCtIon to MaChIne LearnIng
norm_test_data = normalize(test_dataset, data_stats) norm_test_labels = normalize(test_labels, label_stats) model.evaluate(norm_test_data, norm_test_labels) Keep in mind that so far we have been using a single regression unit: things can be even better when we pack many of them into a neural network. We can pick some values from the test set and see how far the model’s predictions are from the expected labels: sampled_data = norm_test_data.iloc[:10] sampled_labels = denormalize(norm_test_labels.iloc[:10], label_stats) predictions = [p[0] for p in denormalize(model.predict(sampled_data), label_stats)] for i in range(10): print(f'predicted: {predictions[i]} ' + f'actual: {sampled_labels.iloc[i]}') Plotting these values on a bar chart should result in a figure like the one shown in Figure 1-11. You can then perform minimal changes to the model_save and model_ load we have encountered in Section 1.4.7 to save and load your model from another notebook or application.
59
Chapter 1
IntroduCtIon to MaChIne LearnIng
1.6 Polynomial regression We now know how to create linear regression models on one or multiple input variables. We have seen that these models can be geometrically represented by n-dimensional linear hyper-surfaces in the n + 1-dimensional space that consists of the n-dimensional input features plus the output feature y—the model will be a line that goes through points in a 2D space in the case of univariate linear regression, it will be a 2D surface that goes through points in a 3D space if you have two input features and one output variable, and so on.
Figure 1-11. Comparison between the predicted and expected values However, not all the problems in the real world can be approximated by linear models. Consider the dataset under /datasets/ house-size-price-2.csv. It’s a variant of the house size vs. price dataset we have encountered earlier, but this time, we hit a plateau between the 100–130 square meters range (e.g., houses of that size or with that particular room configuration don’t sell much in the location), and then the price keeps increasing again after 130 square meters. Its representation is shown in Figure 1-12. You can’t really capture the grow-and-stop-and-grow 60
Chapter 1
IntroduCtIon to MaChIne LearnIng
sequence like this with a straight line alone. Worse, if you try to fit a straight line through this dataset through the linear procedure we have analyzed earlier, the line could end up being “pulled” down by the points around the plateau in order to minimize the overall mean squared error, ending up with a loss of accuracy also on the remaining points in the dataset. In cases like this, you may instead want to leverage a polynomial model. Keep in mind that linear regression can be modelled, in its simplest form, as hθ(x) = θ0 + θ1x, but nothing prevents us from defining a hypothesis function hθ(x) represented by higher powers of the x. For instance, this house size-price non-linear model might not be very well represented by a quadratic function—remember that a parabola first goes up and then down, and you don’t really expect house prices to go significantly down when the size increases. However, a cubic model could work just fine—remember that a cubic function on a plane looks like two “half-parabolas” stuck together around an inflection point, which is also the point of symmetry of the function. So the hypothesis function for this case could be something like this: hq ( x ) = q 0 + q1 x + q 2 x 2 + q 3 x 3
(1.30)
Figure 1-12. Example of house size vs. price dataset that can’t be accurately represented by a linear model
61
Chapter 1
IntroduCtIon to MaChIne LearnIng
A clever way to find the values of q that minimize the preceding function is to treat the additional powers of x as additional variables, perform a variable substitution, and then treat it as a generic multivariate linear regression problem. In other words, we want to translate the problem of gradient descent of a polynomial function into a problem of multivariate linear regression. For instance, we can take the preceding hθ(x) expression and rewrite it as a function of x , where x = éë1,x ,x 2 ,x 3 ùû
(1.31)
So hθ(x) can be rewritten as hq ( x ) = q 0 x 0 + q1 x1 + q 2 x 2 + q 3 x 3
(1.32)
The hypothesis written in this form is the same as the one we have seen in Equation 1.17. We can therefore proceed and calculate the values of θ through the linear multivariate gradient descent procedure we have analyzed, as long as you keep a couple of things in mind: 1. The values of q that you get out of the algorithm must be plugged into the cubic hypothesis function in Equation 1.30 when you make predictions, not into the linear function we have seen in that section. 2. Feature scaling/input normalization is always important in regression models, but it’s even more important when it comes to polynomial regression problems. If the size of a house in square meters is in the range [0, …, 103], then its squared value will be in the range [0, …, 106] and its cubic value will be in the range [0, …, 109]. If you don’t normalize the inputs before feeding them to the model, you will
62
Chapter 1
IntroduCtIon to MaChIne LearnIng
end up with a model where the highest polynomial terms weigh much more than the rest, and such a model may simply not converge. 3. In the preceding example, we have selected a cubic function because it looks like a good fit for the data at a glance—some growth, an inflection point, and then growth again. However, this won’t be true for all the models out there. Some models could perform better with higher polynomial powers (e.g., 4th or 5th powers of the x), or maybe fractional powers of the x—for example, square or cubic roots. Or, in the case of multiple input features, some relations could be well expressed by the product or ratio of some features. The important takeaway here is to always look at your dataset before reasoning on what’s the best analytical function that can fit it. Sometimes you may also want to try and plot a few sample hypothesis functions to see which one has a shape that best fits your data. Overall, translating a polynomial regression problem into a multivariate regression problem is a good idea because, as we have seen previously, the cost function of a linear model is usually expressed by a simple n-dimensional quadratic model, which is guaranteed to have only one point with null gradient and that point is also the global minimum. In such configuration, a well-designed gradient descent algorithm should be able to converge toward the optimal solution by simply following the direction of the gradient vector, without getting “stuck” on bumps and valleys that you may get when you differentiate higher polynomial functions.
63
Chapter 1
IntroduCtIon to MaChIne LearnIng
1.7 Normal equation Gradient descent is definitely among the most popular approaches for solving regression problems. However, it’s not the only way. Gradient descent, as we have seen, finds the parameters q that optimize the cost function by iteratively “walking” along the direction of the gradient until we hit the minimum. The normal equation provides instead an algebraic way to calculate q in one shot. Such an approach, as we will see soon, has some advantages as well as some drawbacks compared to the gradient descent. In this section, we will briefly cover what the normal equation is and how it is derived, without going too much in depth into the formal proof. I will assume that you have some knowledge of linear algebra and vector/matrix operations (inverse, transpose, and product). If that’s not the case, however, feel free to skip this section, or just take note of the final equation. The normal equation provides an analytical alternative to the gradient descent for minimization problems, but it’s not strictly required to build models. We have seen that the generic cost function of a regression model can be written as J (q ) =
2 1 m T q xi - yi ) ( å 2m i =1
(1.33)
And the problem of finding the optimal model is a problem of finding the values of q such that the gradient vector of J (q ) is zero: é ¶ ù ê ¶q J (q ) ú ê 0 ú ÑJ (q ) = ê ú=0 ê ú ê ¶ J (q ) ú êë ¶qn úû
64
(1.34)
Chapter 1
IntroduCtIon to MaChIne LearnIng
Or, in other words: ¶ J (q ) = 0 for j = 0 , ¼, n ¶q j
(1.35)
If you expand the scalar products and sums in Equation 1.33 for a X Î R m´n dataset, where n is the number of features, m is the number of input samples, and y ÎR m represents the vector of the output features, and solve the partial derivatives, you get n + 1 linear equations, where q Î R n +1 represents the variables. Like all systems of linear equations, also this system can be represented as the solution of a matrix × vector product, and we can solve the associated equation to calculate q . It turns out that q can be inferred by solving the following equation: -1
q =(XT X ) XT y
(1.36)
where X is the m × n matrix associated to the input features of your dataset (with an added x0 = 1 term at the beginning of each vector, as we have previously seen), XT denotes the transposed matrix (i.e., the matrix you get by swapping rows and columns), the −1 operator denotes the inverse matrix, and y is the vector of output features of your dataset. The normal equation has a few advantages over the gradient descent algorithm: 1. You won’t need to perform several iterations nor risk getting stuck or diverging: the values of q are calculated straight away by solving the system of n + 1 associated gradient linear equations. 2. As a consequence, you won’t need to choose a learning rate parameter α, as there’s no incremental learning procedure involved.
65
Chapter 1
IntroduCtIon to MaChIne LearnIng
However, it has a few drawbacks: 1. The gradient descent will still perform well even if the number of input features n is very large. A higher number of feature translates in a larger scalar product in your θ-update steps, and the complexity of a scalar product increases linearly when n grows. On the other hand, a larger value of n means a larger XTX matrix in Equation 1.36, and calculating the inverse of a very large matrix is a very computationally expensive procedure (it has a O(n3) complexity). It means that the normal equation could be a good solution for solving regression problems with a low number of input features, while gradient descent will perform better on datasets with more columns. 2. The normal equation works only if the XTX matrix is invertible. A matrix is invertible only if it is a full rank square matrix, that is, its rank equals its number of rows/columns, and therefore, it has a non-zero determinant. If XTX is not full rank, it means that either you have some linearly dependent rows or columns (therefore, you have to remove some redundant features or rows) or that you have more features than dataset items (therefore, you have to either remove some features or add some training examples). These normalization steps are also important in gradient descent, but while a noninvertible dataset matrix could result in either a biased or non-optimal model if you apply gradient descent, it will fail on a division by zero if you apply the normal equation. However, most of the modern 66
Chapter 1
IntroduCtIon to MaChIne LearnIng
frameworks for machine learning also work in the case of non-invertible matrices, as they use mathematical tools for the calculation of the pseudoinverse such as the Moore-Penrose inverse. Anyway, even if the math will still work, keep in mind that a non-invertible characteristic matrix is usually a flag for linearly dependent metrics that may affect the performance of your model, so it’s usually a good idea to prune them before calculating the normal equation.
Figure 1-13. Plot of the logistic function
1.8 Logistic regression Linear regression solves the problem of creating models for numeric predictions. Not all the problems, however, require predictions in a numerically continuous domain. We have previously mentioned that classification problems make up the other large category of problems in machine learning, that is, problems where instead of a raw numeric value you want your model to make a prediction about the class, or type, of the 67
Chapter 1
IntroduCtIon to MaChIne LearnIng
provided input (e.g., Is it spam? Is it an anomaly? Does it contain a cat?). Fortunately, we won’t need to perform many changes to the regression procedure we have seen so far in order to adapt it to classification problems. We have already learned to define a hq ( x ) hypothesis function that maps x ÎR n to real values. We just need to find a hypothesis function that outputs values such that 0 £ hq ( x ) £ 1 and define a threshold function that maps the output value to a numeric class (e.g., 0 for false and 1 for true). In other words, given a linear model q T x , we need to find a function g such that hq ( x ) = g (q T x )
(1.37)
0 £ hq ( x ) £ 1
(1.38)
A common choice for g is the sigmoid function, or logistic function, which also gives this type of regression the name of logistic regression. The logistic function of a variable z has the following formulation: g (z ) =
1 1 + e -z
(1.39)
The shape of this type of function is shown in Figure 1-13. The values of the function will get close to zero as the function decreases and close to one as the function increases, and the function has an inflection point around z = 0 where its value is 0.5. It is therefore a good candidate to map the outputs of linear regression into the [0…1] range, with a strong nonlinearity around the origin to map the “jump.” By plugging our linear model into Equation 1.39, we get the formulation of the logistic regression: hq ( x ) = g (q T x ) =
68
1 1 + e -q
(1.40) T
x
Chapter 1
IntroduCtIon to MaChIne LearnIng
Let’s stick for now to the case of logistic regression with a single output class (false/true). We’ll see soon how to expand it to multiclass problems as well. If we stick to this definition, then the logistic curve earlier expresses the probability for an input to be “true.” You can interpret the output value of the logistic function as a Bayesian probability—the probability that an item does/does not belong to the output class given its input features: hq ( x ) = P ( “ true ”| x )
(1.41)
Given the shape of the sigmoid function, we can formalize the classification problem as follows: ìtrue if g (q T x ) ³ 0.5 ï prediction = í T ïîfalse if g (q x ) < 0.5
(1.42)
Since g(z) ≥ 0.5 when z ≥ 0, we can reformulate the preceding expression as ìtrue if q T x ³ 0 prediction = í T îfalse if q x < 0
(1.43)
The idea behind logistic regression is to draw a decision boundary. If the underlying regression model is a linear model, then imagine drawing a line (or a hyper-surface) across your data. The points on the left will represent the negative values and the points on the right the positive values. If the underlying model is a more complex polynomial model, then the decision boundary can be more sophisticated in modelling non-linear relations across the data points.
69
Chapter 1
IntroduCtIon to MaChIne LearnIng
Let’s make an example: consider the dataset under / datasets/spam-email-1.csv. It contains a dataset for spam email detection, and each row contains the metadata associated to an email. 1. The first row, blacklist_word_count, reports how many words in the email match a blacklist of words often associated to spam emails. 2. The second row, sender_spam_score, is a score between 0 and 1 assigned by a spam filter that represents the probability that the email is spam on the basis of the sender’s email address, domain, or internal domain policy. 3. The third row, is_spam, is 0 if the email was not spam and 1 if the email was spam. We can plot the dataset to see if there is any correlation between the metrics. We will plot blacklist_word_count on the x axis and sender_ spam_score on the y axis and represent the associated dot in red if it’s spam and in blue if it’s not spam:
Figure 1-14. Plot of the spam email dataset
70
Chapter 1
IntroduCtIon to MaChIne LearnIng
import pandas as pd import matplotlib.pyplot as plt csv_url = 'https://raw.githubusercontent.com/BlackLight' + '/mlbook-code/master/datasets/spam-email-1.csv') data = pd.read_csv(csv_url) # Split spam and non-spam rows spam = data[data['is_spam'] == 1] non_spam = data[data['is_spam'] == 0] columns = data.keys() fig = plt.figure() # Plot the non-spam data points in blue non_spam_plot = fig.add_subplot() non_spam_plot.set_xlabel(columns[0]) non_spam_plot.set_ylabel(columns[1]) non_spam_plot.scatter(non_spam[columns[0]], non_ spam[columns[1]], c='b') # Plot the spam data points in red spam_plot = fig.add_subplot() spam_plot.scatter(spam[columns[0]], spam[columns[1]], c='r')
71
Chapter 1
IntroduCtIon to MaChIne LearnIng
You should see a graph like the one shown in Figure 1-14. We can visually see that we can approximately draw a line on the plane to split spam from non-spam. Depending on the slope of the line we pick, we may let a few cases slip through, but a good separation line should be accurate enough to make good predictions in most of the cases. The task of logistic regression is to find the parameters of θ that we can plot in Equation 1.40 to get a good prediction model.
Figure 1-15. Example plots of the logistic regression cost function J (q ) with one input variable
1.8.1 Cost function Most of the principles we have seen in linear regression (definition of a linear or polynomial model, cost/loss function, gradient descent, feature normalization, etc.) also apply to logistic regression. The main difference consists in how we write the cost function J (q ) . While the mean squared error is a metric that makes sense when you want to calculate what’s the mean error between your prediction of a price and the actual price, it doesn’t make much sense when you want to find out if you predicted the correct class of a data point or not, with a given discrete number of classes.
72
Chapter 1
IntroduCtIon to MaChIne LearnIng
We want to build a cost function that expresses the classification error—that is, whether or not the predicted class was correct or wrong— and how “large” the classification error is, that is, how “certain/uncertain” about its classification the model was. Let’s rewrite the cost function J (q ) we have defined earlier by calling C ( hq ( x ) ,y ) its new argument: J (q ) =
1 m åC ( hq ( xi ) ,yi ) m i =1
(1.44)
When it comes to logistic regression, the function C is often expressed in this form (we will stick to a binary classification problem for now, i.e., a problem where there are only two output classes, like true/false; we will expand it to multiple classes later): ïì- log ( hq ( x ) ) if y = 1 C ( hq ( x ) ,y ) = í îï- log (1 - hq ( x ) ) if y = 0
(1.45)
The intuition is as follows: 1. If yi = 1 (the class of the i-th data points is positive) and the predicted value hq ( x ) also equals 1, then the cost will be zero (–log(1) = 0, i.e., there was no prediction error). The cost will gradually increase as the predicted value gets far from 1. If yi = 1 and hq ( x ) = 0 , then the cost function will assume an infinite value (− log (0) = ∞). In real applications, of course, we won’t use infinity, but we may use a very large number instead. The case where the real value is 1 and the predicted value is 0 is like a case where
73
Chapter 1
IntroduCtIon to MaChIne LearnIng
a model predicts with 100% confidence that it’s raining outside while instead it’s not raining—if it happens, you want to bring the model back on track by applying a large cost function. 2. Similarly, if yi = 0 and the predicted value also equals 0, then the cost function will be null (–log(1 – 0) = 0). If instead yi = 0 and hq ( x ) = 1 , then the cost will go to infinity. In the case of one input and one output variable, the plot of the cost function will look like in Figure 1-15. If we combine together the expressions in Equation 1.45, we can rewrite the logistic regression cost function in Equation 1.44 as J (q ) = -
) (
( ))
1 é m (i ) ù y log hq x (i ) + 1 - y (i ) log 1 - hq x (i ) ú å ê m ë i =1 û
( ) (
(1.46)
We have compacted together the two expressions of Equation 1.45. If yi = 1, then the first term of the sum in square brackets applies, and if yi = 0, then the second applies. Just like in linear regression, the problem of finding the optimal values of q is a problem of minimizing the cost function—that is, perform gradient descent. We can therefore still apply the gradient update steps shown in Equation 1.9 to get the direction to the bottom of the cost function surface. Additionally, just like in linear regression, we are leveraging a convex cost function—that is, a cost function with a single null-gradient point that also happens to be the global minimum. By replacing hq ( x ) in the preceding formula with the logistic function defined in Equation 1.40 and solving the partial derivatives in Equation 1.9, we can derive the update step for q at the k + 1-step for logistic regression:
q j( k +1) = q j( k ) 74
a m hq x (i ) - y (i ) x (ji ) for j = 0¼n å m i =1
( ( )
)
(1.47)
Chapter 1
IntroduCtIon to MaChIne LearnIng
You’ll notice that the formulation of the θ-update step is basically the same as we saw for the linear regression in Equation 1.22, even though we’ve come to it through a different route. And we should have probably expected it, since our problem is still a problem of finding a line that fits our data in some way. The only difference is that the hypothesis function hθ in the linear case is a linear combination of the q parameters and the input vector ( q T x ), while in the logistic regression case, it’s the sigmoid function we have introduced in Equation 1.40.
1.8.2 Building the regression model from scratch We can now extend the gradient descent algorithm we have previously seen for the linear case to work for logistic regression. Let’s actually put together all the pieces we have analyzed so far (hypothesis function, cost function, and gradient descent) to build a small framework for regression problems. In most of the cases, you won’t have to build a regression algorithm from scratch, but it’s a good way to see how the concepts we have covered so far work in practice. First, let’s define the hypothesis function for logistic regression: import math import numpy as np def h(theta): """ Return the hypothesis function associated to the parameters theta """
75
Chapter 1
IntroduCtIon to MaChIne LearnIng
def _model(x): """ Return the hypothesis for an input vector x given the parameters theta. Note that we use numpy.dot here as a more compact way to represent the scalar product theta*x """ ret = 1./(1 + math.exp(-np.dot(theta, x))) # Return True if the hypothesis is >= 0.5, # otherwise return False return ret >= 0.5 return _model
Note that if we replace the preceding hypothesis function with the scalar product q T x , we convert a logistic regression problem into a linear regression problem: import numpy as np def h(theta): def _model(x): return np.dot(theta, x) return _model
76
Chapter 1
IntroduCtIon to MaChIne LearnIng
Then let’s code the gradient descent algorithm: def gradient_descent(x, y, theta, alpha): """ Perform the gradient descent. :param x: Array of input vectors :param y: Output vector :param theta: Values for theta :param alpha: Learning rate """ # Number of samples m = len(x) # Number of features+1 n = len(theta) new_theta = np.zeros(n) # Perform the gradient descent on theta for j in range(n): for i in range(m): new_theta[j] += (h(theta)(x[i]) - y[i]) * x[i][j] new_theta[j] = theta[j] - (alpha/m) * new_theta[j] return new_theta
77
Chapter 1
IntroduCtIon to MaChIne LearnIng
Then a train method, which consists of epochs gradient descent iterations: def train(x, y, epochs, alpha=0.001): """ Train a model on the specified dataset :param x: Array of normalized input vectors :param y: Normalized output vector :param epochs: Number of training iterations :param alpha: Learning rate """ # Set x0=1 new_x = np.ones((x.shape[0], x.shape[1]+1)) new_x[:, 1:] = x x = new_x # Initialize theta randomly theta = np.random.randint(low=0, high=10, size=len(x[0])) # Perform the gradient descent times for i in range(epochs): theta = gradient_descent(x, y, theta, alpha) # Return the hypothesis function associated to # the parameters theta return h(theta)
78
Chapter 1
IntroduCtIon to MaChIne LearnIng
Finally, a prediction function that, given an input vector, the stats of the dataset, and the model 1. Normalizes the input vector 2. Sets x0 = 1 3. Returns the prediction according to the given hypothesis hθ
def normalize(x, stats): return (x - stats['mean']) / stats['std'] def denormalize(x, stats): return stats['std'] * x + stats['mean'] def predict(x, stats, model): """ Make a prediction given a model and an input vector """ # Normalize the values x = normalize(x, stats).values # Set x0=1 x = np.insert(x, 0, 1.) # Get the prediction return model(x)
79
Chapter 1
IntroduCtIon to MaChIne LearnIng
Last, an evaluate function that given a list of input values and the expected outputs evaluates the accuracy (number of correctly classified inputs divided by the total number of inputs) of the given hypothesis function: def evaluate(x, y, stats, model): """ Evaluate the accuracy of a model. :param x: Array of input vectors :param y: Vector of expected outputs :param stats: Input normalization stats :param model: Hypothesis function """ n_samples = len(x) ok_samples = 0 for i, row in x.iterrows(): expected = y[i] predicted = predict(row, stats, model) if expected == predicted: ok_samples += 1 return ok_samples/n_samples
80
Chapter 1
IntroduCtIon to MaChIne LearnIng
Now, let’s give this framework a try by training and evaluating a model for spam detection on the dataset that we have previously loaded: columns = dataset_stats.keys() # x contains the input features (first two columns) inputs = data.loc[:, columns[:2]] # y contains the output features (last column) outputs = data.loc[:, columns[2]] # Get the statistics for inputs and outputs x_stats = inputs.describe().transpose() y_stats = outputs.describe().transpose() # Normalize the features norm_x = normalize(inputs, x_stats) norm_y = normalize(outputs, y_stats) # Train a classifier on the normalized data spam_classifier = train(norm_x, norm_y, epochs=100) # Evaluate the accuracy of the classifier accuracy = evaluate(inputs, outputs, x_stats, spam_classifier) print(accuracy) Hopefully you should measure a >85% accuracy, which isn’t that bad if we look back at how the original data is distributed, and the fact that we defined a linear decision boundary.
81
Chapter 1
IntroduCtIon to MaChIne LearnIng
1.8.3 The TensorFlow way Now that we have learned all the nuts and bolts of a regression model, let’s build a logistic regression model to solve our spam classification problem with TensorFlow. Only few tweaks are required to the example we have previously seen for linear regression: from tensorflow.keras.experimental import LinearModel columns = dataset_stats.keys() # Input features are on the first two columns inputs = data.loc[:, columns[:2]] # Output feature is on the last column outputs = data.loc[:, columns[2:]] # Normalize the inputs x_stats = inputs.describe().transpose() norm_x = normalize(inputs, x_stats) # Define and compile the model model = LinearModel(2, activation='sigmoid', dtype='float32') model.compile(optimizer='sgd', loss='sparse_categorical_ crossentropy', metrics=['accuracy', 'sparse_categorical_crossentropy']) # Train the model history = model.fit(norm_x, outputs, epochs=700, verbose=0)
82
Chapter 1
IntroduCtIon to MaChIne LearnIng
A few changes in the preceding code compared to the linear model: •
We use the sigmoid activation function, as defined in Equation 1.39, instead of linear.
•
We use the sgd optimizer—stochastic gradient descent.
•
We use a categorical cross-entropy loss function, similar to the one defined in Equation 1.46.
•
We use accuracy (i.e., number of correctly classified samples divided by the total number of samples) as a performance metric.
Figure 1-16. Loss function of the logistic model over training epochs Let’s see how the loss function progressed over the training iterations: epochs = history.epoch loss = history.history['loss'] fig = plt.figure() plot = fig.add_subplot()
83
Chapter 1
IntroduCtIon to MaChIne LearnIng
plot.set_xlabel('epoch') plot.set_ylabel('loss') plot.plot(epochs, loss) You should see a graph like the one shown in Figure 1-16. All the other model methods we saw for the linear regression case—evaluate, predict, save, and load—will also work for the logistic regression case.
1.8.4 Multiclass regression We have so far covered logistic regression on multiple input features but one single output class—true or false. Extending logistic regression to work with multiple output classes is a quite intuitive process called one vs. all. Suppose that in the case we saw earlier, instead of a binary spam/ non-spam classification, we had three possible classes: normal, important, and spam. The idea is to break down the original classification problem into three binary logistic regression hypothesis functions hθ, one per class: 1. A function hq(1) ( x ) to model the decision boundary normal/not normal (2) 2. A function hq ( x ) to model the decision boundary
important/not important 3. A function hq( 3) ( x ) to model the decision boundary spam/not spam
84
Chapter 1
IntroduCtIon to MaChIne LearnIng
Figure 1-17. Plot of a dataset with a non-linear decision boundary Each hypothesis function represents the probability that an input vector belongs to a specific class. Therefore, the hypothesis function with the highest value is the one the input belongs to. In other words, given an input x and c classes, we want to pick the class for x that has the highest hypothesis function value associated: max hq(i ) ( x )
(1.48)
0£i £ c
The intuition is that the hypothesis function with the highest value represents the class with the highest probability, and that’s the prediction we want to pick.
1.8.5 Non-linear boundaries So far we have explored logistic regression models where the argument of the sigmoid is a linear function. However, just like in linear regression, not all the classification problems out there can be modelled by drawing linear decision boundaries. The distribution of the dataset in Figure 1-17 is definitely not well fit for a linear decision boundary, but an elliptic decision boundary could do the job quite well. Just like we saw in linear regression, a non-linear model function can effectively be translated into a
85
Chapter 1
IntroduCtIon to MaChIne LearnIng
linear multivariate function through variable substitution, and we can then minimize that function in order to get the values of q . In the case earlier, a good hypothesis function could be something that looks like this: hq ( x ) = g (q 0 x 0 + q1 x1 + q 2 x 2 + q 3 x1 x 2 + q 4 x12 + q 5 x 22 )
(1.49)
We can then apply the variable substitution procedure we have already seen to replace the higher polynomial terms with new variables and proceed with solving the associated multivariate regression problem with a linear model to get the values of q . Just like in linear regression, there is no limit to the number of additional polynomial terms you can add to your model to better express complex decision boundaries—take into account anyway that adding too many polynomial terms could end up modelling decision boundaries that may overfit your data. As we will see later, another common approach for detecting non-linear boundaries is to connect multiple logistic regression units together—that is, building a neural network.
86
CHAPTER 2
Neural Networks Now that we have an idea of how to use regression to train a model, it’s time to explore the next step—fitting multiple regression units into a neural network. Neural networks come as a better approach to solve complex regression or classification problems with non-linear dependencies between the input features. We have seen that linear regression and logistic regression with linear functions can perform well if the features of the underlying data are bound by linear relations. In that case, fitting a line (or a linear hyper-surface) through your data is a good strategy to train a model. We have also seen that more complex non-linear relations can be expressed by adding more polynomial terms to the hypothesis function— for instance, quadratic or cubic relations or terms with the product of some features. There’s a limit, however, to adding more synthetic features as extra polynomial terms. Consider the example shown in Figure 2-1. A linear division boundary may not be that accurate in describing the relation between the input features. We can add more polynomial terms (e.g., x1x2, x12 , or x1 x 22 ), and that could indeed fit our data well, but we’re running the risk of overfitting the dataset (i.e., produce a model that performs well on the specific training input but is not generic enough for other cases). Moreover, the approach may still be sustainable if we only have two input features. Try, however, to picture a real-case scenario like the price of a house, which can depend on a vast amount of features that are not
© Fabio Manganiello 2021 F. Manganiello, Computer Vision with Maker Tech, https://doi.org/10.1007/978-1-4842-6821-6_2
87
CHAPTER 2
NEURAL NETWORKS
necessarily linearly related, and you’ll notice that the number of additional polynomial features required by your model will easily explode. Training a regression model with multiple input features and multiple non-linear terms has two big drawbacks: •
The relations between the features are hard to visualize.
•
It’s an approach prone to the combinational explosion. You’ll be likely to end up with a lot of features to express all the relations. Such a model will be very expensive to train while still prone to overfit.
And things only get trickier when we move to the domain of image detection. Keep in mind that a computer only sees the raw pixel values in an image, and objects that are pictured in an image often have non-linear boundaries.
Figure 2-1. Example of non-linear boundary between two variables that can be tricky to express with logistic regression alone When your data consists of many input features and when the distribution of your data or the boundaries between its classes are nonlinear, it’s usually a better idea to organize your regression classifiers into a network to better capture the increased complexity, instead of attempting to build a single regression classifier that maybe comes with a lot of 88
CHAPTER 2
NEURAL NETWORKS
polynomial terms to best describe your training set but can easily end up overfitting your data without actually providing good predictions. The idea is similar to the way most of the animals (and humans) learn. The neurons in our system are strongly wired together through a structure akin to the one shown in Figure 2-2, and they continuously interact through small variations of electric potential both with the periphery of the body (on fibers called axons, each connected to a neuron, if you bundle more axons together you get a nerve) and with other neurons (on connection fibers called dendrites). The connections between the axon from a neuron and the dendrites of the next neuron are called synapses. The electrochemical signals sent over these connections are what allows animals to see, hear, smell, or feel pain and what allows us to consciously move an arm or a leg. These connections, however, constantly mutate over a lifetime span. Connections between neurons are forged on the basis of the sensory signals that are gathered from the environment around and on the basis of the experience we collect. We don’t innately know how to grasp an object the right way, but we gradually learn it within the first months of our lives. We don’t innately know how to speak the language of our parents, but we gradually learn it as we are exposed to more and more examples. Physically, this happens by a continuous process of fine-tuning of the connections between neurons in order to optimize how we perform a certain task. Neurons quickly specialize to perform a certain task—those in the back of our head process the images from our eyes, while those in the pre-frontal cortex are usually in charge for abstract thought and planning—by synchronizing with their connected neighbors. Neurons quickly learn to fire electrochemical impulses or stay silent whenever the net input signals are above or below a certain threshold, and connections are continuously re-modelled, according to the idea that neurons that fire together wire together more strongly. And the nervous system is in charge not only of creating new connections to better react to the environment but also to keep the number of connections optimized—if each neuron was strongly connected to all other neurons, our bodies would require a 89
CHAPTER 2
NEURAL NETWORKS
lot of energy just to keep all the connections running—so neural paths that aren’t used for a certain period of time eventually lose their strength; that’s why, for example, we tend to forget notions that we haven’t refreshed for a long time.
Figure 2-2. Principal components of a physical neuron Artificial neural networks are modelled in a way that closely mimics their biological counterparts. Not only that, but the fields of artificial intelligence and neuroscience have a longstanding tradition of influencing each other—artificial neural networks are modelled mimicking the biological networks, while progress in the development of artificial neural networks often sheds light on overlooked features of the brain. Imagine each neuron in an artificial neural network as a computational unit with a certain number of inputs, which approximately map the dendrites in the physical cell. Each input “wire” has a weight that can be adjusted during the learning phase, which approximately maps the synapses of the physical neuron. The neuron itself can be seen as a computational unit that performs the weighted sum of its inputs and applies a non-linear function (like the logistic curve we have previously seen) to map an on/off output state. If the output of the activation function is higher than a certain threshold, then the neuron will “fire” a signal. The output of each neuron can either be fed to another neuron or be one of the terminal outputs of your network. The simplest case of a network is the perceptron (see 90
CHAPTER 2
NEURAL NETWORKS
Figure 2-3), similar to what Frank Rosenblatt designed in 1957 to try and recognize people in images. It’s a network with a single neuron with a set of n + 1 input features x = [ x 0 ,x1 ,x 2 ,¼,xn ] (just like in the case of regression, we are using an accessory x0 = 1 input to express the linear model in a more compact way). Each of the inputs is connected to a neuron through a vector of n + 1 weights theta = [q 0 ,q1 ,q 2 ,¼,qn ] . The unit will output an activation value a( x ), defined as the weighted sum of its inputs: n
a ( x ) = åqi xi
(2.1)
i =0
Such activation value will go through an activation function hθ(x), usually the sigmoid/logistic function we saw in Equation 1.39, that will map the real value onto a discrete on/off value: g (z ) =
1 1 + e -z
(2.2)
Figure 2-3. Logic model of a network with a single neuron (perceptron) Such a learning unit can work for some simple cases, but in more complex cases, it will face the same issues encountered by Rosenblatt’s first perceptron. Spotting people or objects in images is a learning problem 91
CHAPTER 2
NEURAL NETWORKS
with a high number of inputs—in its most naive implementation, each pixel of the image will be an input of the model. A single neuron isn’t sufficient in general to model such a high level of variability. That’s why nowadays it’s much more common to pack multiple perceptron units into a neural network, where each output of each neuron is connected to the input of the neuron in the next layer, as shown in Figure 2-4. By packing more interconnected neurons, a learning unit usually becomes better at recognizing complex patterns in problems with a high number of dimensions. The example in the figure shows a neural network with three layers. This is a quite common architecture for simple cases, but we’ll soon see that problems with more nuanced patterns can better be solved with networks with more intermediate layers. By convention, the first layer is called input layer, and it usually has as many neurons as the dimensions of the input datasets (plus one, with x0 = 1). The second and any other intermediate layer is usually called hidden layer, as they do most of the inference work, but they are not directly connected neither to the inputs nor to the outputs of the network. The last layer of the network is called output layer, and it contains as many outputs as the output classes/labels (one true/false class in the case in the figure, but we’ll soon see network with a higher number of outputs as well). Note that the notation ai( j ) denotes the activation value of the i-th unit in the j-th layer of the network, while hQ ( x ) denotes the hypothesis (or prediction) of the model in function of the configured parameters Θ, calculated as hQ ( x ) =
1 1 + e - Qx
While in the regression case the parameters (or weights) of the model were a vector, in this case, each j-th layer will have an associated Θ(j) matrix to map the weights between the j-th and the j + 1-th layer. Remember that each of the units in j is connected to each of the units in j + 1, so Θj will be 92
CHAPTER 2
NEURAL NETWORKS
a m × n matrix, where m is the number of units in j and n is the number of units in j + 1. Therefore, we can visualize Θ as a 3D tensor. Intuitively, a tensor is a multi-dimensional generalization of a matrix. In our case, each j-th “slice” of the tensor Θ represents the 2D matrix of weights of the j-th layer. If we put together all the pieces of information we’ve gathered so far, we can formalize the activation function of each neuron in the figure as follows:
Figure 2-4. Logic model of an artificial neural network with three layers •
The units in the first layer are usually mapped oneto-one to the input features, unless you want to assign each feature a different weight: ai(1) = xi for i = 0 ¼ n
93
CHAPTER 2
•
NEURAL NETWORKS
The activation values of the units in the second layer is the logistic function of the weighted sum of the inputs:
(
)
(
)
(1) (1) (1) a1( 2 ) = g Q10 x 0 + Q11 x1 + Q12 x 2 +¼+ Q1(1n) xn
a2( 2 ) = g Q(201) x 0 + Q(211) x1 + Q(221) x 2 +¼+ Q(21n) xn ¼
(
(1) an( 2 ) = g Qn(10) x 0 + Qn(11) x1 + Qn(12) x 2 +¼+ Qnn xn
•
)
The activation value of the unit in the last layer is the logistic function of the weighted sum of the outputs of the units from the previous layer:
(
(2) (2) (2) (2) (2) (2) a1( 3) = hQ ( x ) = g Q10 a0 + Q11 a1 + Q12 a2 +¼+ Q1( 2n) an( 2 )
)
We can generalize the formula for the activation function of the i-th unit in the j-th layer as follows: æ n ö ai( j ) = g ç åQik( j -1) ak( j -1) ÷ è k =0 ø
(2.3)
Or, using a vectorial notation, we can describe the vector of activation values of the units in the j-th layer as
(
T
a ( j ) = g Q( j -1) a ( j -1)
94
)
(2.4)
CHAPTER 2
NEURAL NETWORKS
Such algorithm is commonly known as forward-propagation, and it’s how a neural network with a certain set of weights makes predictions given some input data—the intuition is basically to propagate the input values through each node, from input to output layer. Forward-propagation can be seen as a generalization of the hypothesis function used in logistic regression in a multi-unit scenario. A couple of observations: •
Keep in mind that g is the logistic function, and we use it on each layer to “discretize” the weighted sum of the inputs.
•
We have so far set x0 = 1, so that each of the input vectors actually has size n + 1. We keep this practice also for neural networks, each j-th layer will have its own bias unit a0( j ) . We can assume that these bias values always equal one for now, but we will see later on how to tune the bias vector to improve the performance of our models.
•
In the example we’ve considered so far both the input and hidden layer have n + 1 units. While the input layer has indeed n + 1 units in most of the cases, where n is the dimension of the inputs, and the output layer mostly has as many units as the number of output classes, there is no constraint on the number of units in the hidden layer. Actually, it’s a common practice in many cases to have more units (at least in the first hidden layer) than the number of the input dimensions, but keep an eye on the performance of your model to make sure that you don’t keep adding hidden units beyond a point that doesn’t actually improve your performance metrics.
95
CHAPTER 2
•
NEURAL NETWORKS
As I mentioned earlier, there’s no constraint on the number of hidden layers either. Indeed, adding more hidden layers will usually improve the performance of your model in many cases, as your network will be able to detect more nuanced patterns. However, just like the good practice for the number of units, you may also not want to overengineer your model by adding more layers than those actually required by the type of classification problems you want to solve, since adding either too many intermediate layers or units may have no measurable impact on the network in the best case and deteriorate the performance of the network because of overfit in the worst case. Again, the best practice is to try different number of layers with different number of units and see when you hit the “sweet spot” between good model performance metrics and good system performance metrics.
2.1 Back-propagation If forward-propagation is the way a neural network makes predictions, back-propagation is the way a neural network “learns”—that is, how it adjusts its tensor of weights Θ given a training set. Just like we did for regression, also for neural networks, the learning phase can be inferred by defining a cost function that we want to optimize. In the general case, you will have a neural network with K outputs, where those outputs are the classes that you want to detect. Each output expresses the probability, between 0 and 1, that a certain input belongs to that class. If the input of your network are pictures of items of clothing, for example, and you want to detect whether a picture contains a shirt, a skirt, or a pair of trousers, you may want to model a neural network with 96
CHAPTER 2
NEURAL NETWORKS
three output units. If a picture contains a shirt, for instance, you want that network to output something close to [1, 0, 0]. If it’s a pair of trousers, you want it to output something like [0, 0, 1], and so on. So instead of a hypothesis function with a single output variable, like we have seen so far, you will have a hypothesis functions hQ ( x ) ÎR K with a vector of K values, one for each class. The prediction class of your model will usually be the index of the hQ ( x ) function with the highest value: class = argmax i hQ(i ) ( x ) In the previous example, if you get a hypothesis vector like [0.8,0.2,0.1] for a certain picture, and your output units are set in the order [shirt, skirt, trousers], then it’s likely that that picture contains a shirt. So the job of the cost function of a neural network is to minimize the classification error between the i-th vector of labels in the training set, y (i ) , and the predicted output vector hQ ( x ) . We have seen this type of cost function already in the case of a true/false binary classification problem in Equation 1.46, when we covered logistic regression: J (q ) = -
) (
( ))
1 é m (i ) ù y log hq x (i ) + 1 - y (i ) log 1 - hq x (i ) ú å ê m ë i =1 û
( ) (
(2.5)
If instead of one single label y(i) for the i-th input sample we have a vector of K items, and instead of a one-dimensional vector of weights θ we have a 3D tensor of weights Θ, with each slice representing the weight matrix to map one layer to the next, then the cost function can be rewritten as follows:
97
CHAPTER 2
NEURAL NETWORKS
J (Q) = -
( ( )) + (1 - y )log (1 - h ( x )) ùúû
1 é m K (i ) ååyk log hQ x (i ) m êë i =1 k =1
+
(i ) k
k
l L -1 sl sl +1 (l ) ååå Qij 2m l =1 i =1 j =1
( )
2
(i )
Q
+
k
(2.6)
l ) is the conventional 2m way to encode the bias inputs of each layer (the weights of the x 0( j ) elements)—as the sum of the squares of the weights between the l-th and l + 1-th layer. λ is the bias rate, or regularization rate of the network, and it defines the inertia of the network against changes—a high value in this case leads to a more conservative model, that is, a model that will be slower to apply corrections to its weights, while a low value leads to a model that will adapt faster to changes, at the risk of overfitting the data. Training phases usually start with a lower bias rate in order to quickly adjust to corrections at the beginning which slowly decreases over time. Just like in the case of regression, finding the optimal values of Θ is a problem of minimizing the preceding cost function; therefore, perform some form of gradient descent to find its minimum. In neural networks, this process is usually done layer by layer, starting from the output layer and adjusting the weights backward layer by layer (that’s why it’s called back-propagation). The intuition is to first compare the outputs of the network against the expected samples, and as we adjust the weights of the units in the output layer to match the output more closely, we calculate new intermediate expected results for the units in the previous layer, and we proceed with adjusting the weights until we get to the first layer. Let’s consider a network with L = 4 layers and two outputs like the one shown in Figure 2-5. If we present it with an input x = [ x1 ,x 2 ,x 3 ] and a vector of expected labels y = [ y1 ,y 2 ] and apply forward-propagation to it, The second term in the sum (multiplied by
( )
we can calculate its hypothesis function hQ x (i ) ÎR 2 :
98
CHAPTER 2
NEURAL NETWORKS
a (1) = x
) a ) = h ( x ) = g (Q
( = g (Q
T
a ( 2 ) = g Q(1) a 1 a ( 3) a (4)
( 2 )T
Q
2
( 3 )T
a3
)
Then, just like in the case of regression, we want to find the values of Θ (which is a 3D tensor in this case) that minimize the cost function J(Θ). In other words, for each layer l, we want to calculate its gradient vector ∇J(Θ(l)). We want the values of this vector to be as close as possible to zero:
( )
ÑJ Q(l ) = 0 "1 £ l £ L - 1 set
It means that the partial derivatives of J(Θ(l)) with respect to its weights Q should be set to zero, so we can derive the optimal weights out of the resulting equations: (l ) ij
Figure 2-5. Example of a neural network with L = 4 layers, n = 3 inputs, and K = 2 outputs
99
CHAPTER 2
NEURAL NETWORKS
1 £ i £ sl , ¶ J Q( l ) = 0 (l ) set 1 £ j £ s l +1 ¶Qij
( )
We then define a quantity d j(l ) as the error coefficient of the j-th unit in the l-th layer, starting from the last layer, where the coefficient is defined as
d j( 4 ) = a (j 4 ) - y j for j = 1,2 Or, in vectorial form:
d (4) = a (4) - y Taking into account the error on the output layer, we calculate the correction to the weights that connect those units to the units in the previous layers in the following way:
There are quite a few things happening here so let’s dig term by term:
100
•
Θ(3) is the matrix containing the weights that connect the second to the third layer of the network. If, like in Figure 2-5, the third layer has three units and the second layer has four units, then Θ(3) is a 3 × 4 matrix.
•
We perform a matrix-vector product between the transposed matrix of the weights in a layer and the vector of δ correction coefficients calculated at the following layer. The result is a vector that contains as many elements as the number of units in the layer.
CHAPTER 2
NEURAL NETWORKS
•
We then perform an element-wise product (or Hadamard product, denoted by ⊙) between that vector and the vector of partial derivatives of the activation function of the units in the layer (using the notation we have seen in Equation 2.4). The element-wise product is intuitively the element-by-element product between two vectors with the same size, for example:
•
∇g is the gradient vector of the activation function (usually the sigmoid function) calculated for each unit.
We can generalize the preceding expression to express the correction coefficients for the units in the l-th layer as
(
T
)
The activation function g Q(l -1) a (l -1) , as seen in Equation 2.4, expresses the activation values of the units in the l-th layer, a(l). By solving the derivatives, we can infer this formulation for δ(l): (2.7) It is possible to prove (although the process of derivation is quite lengthy and we’ll skip it in this chapter) that the following relation exists between the partial derivatives of the cost function and the coefficients d (ignoring for now the bias coefficients and the normalization rate λ): ¶ J Q = a (j l )d i (l +1) (l ) ( ) ¶Qij
(2.8)
101
CHAPTER 2
NEURAL NETWORKS
Since the partial derivatives of the cost function are exactly what we want to minimize, we can use the findings of Equations 2.7 and 2.8 to define how a network with L layers “learns” through cycles of forward and back-propagation on a training set: •
Initialize the weights Θ in your model, either randomly or according to some heuristic, and initialize a tensor Δ that contains the partial derivatives of the cost function for each weight, with Dij(l ) = 0 for each connection of the i-th unit in the l + 1-th layer with the j-th unit in the l-th layer.
•
Iterate over all the items in a normalized training set
{(
) (
X = x (1) ,y (1) ,¼, x (m ) ,y (m )
)} .
•
For each i-th training item, set a (1) = x (i ) —the input units of the network will be initialized with each of the normalized input vectors.
•
Perform forward-propagation to compute the activation values of the units in the next layers, a (l ) , with 1