492 105 75MB
English Pages [234] Year 2020
PYTHON MACHINE LEARNING Machine learning is one of the skills considered must for the future yeors. As tosks are increasing it has became time consuming to program, machine learning allows machines to learn on their own and produce the same results. Machine learning allows machines to learn on their own through feeding them data. I assume that you have prior knowledge of Python programming and data science, if you don't you can check these books on the next page. A complete guidebook for anyone who wants to master machine learning with Python.
Rahul Mula
Mochine Learning with Python by Rahul Mula © 2020 Machine Learning with Python All rights reserved. No portion of this book may be reproduced in any form without permission from the copyright holder, except as permitted by U.S. copyright law. Cover by Rahul Mula. All the programs written in this book are tested and verified by the author. Cover Template from freepik.com ISBN : 979-8-58-426755-1
/----------------------------------------------------------------------------------------- X
MAKE SURE TO CHECK THEM OUT
Python For Beginner
A beginners guide to programming with python
Data Science with Python
Learn how to perform tasks like data processing, cleansing, analysis and visualization
Why should you learn machine learning? or what are its uses? Would be the questions that may come to your mind. The answer is simple, think that you are given a data from a online store about its products and recommend products. The data has a product name, its category, its quantity, and rate columns with several hundred rows of products. If you want to perform some analysis like the product which is most purchased in that day, it will take a lot of time to do it manually. To ease up these tasks, we use data analysis, i.e. we run a program with codes to perform a certain data analysis. The computer runs the program and we get the output in just few seconds. Then we classify the user and suggest it products based on preferable categories on the basis of it's previous search results.So, How to do that? Well, we need to learn Data Science and Machine Learning to perform those tasks. Businesses S organizations are trying to deal with it by building intelligent systems using the concepts and methodologies from Data science, Data Mining and Machine learning. Among them, machine learning is the most exciting field of computer science. It would not be wrong if we call machine learning the application and science of algorithms that provides sense to the data. This book is prepared especially for beginners (at Data Science and Machine Learning), but you should
be familiar with programming in Python. We will work with packages and modules like NumPy, SciPy, Pandas, Matplotlib, Scikit Learn, etc. to perform analysis and other tasks. I kept this book open to the basic concepts of data science to help the beginners to understand everything but the book only covers data science concepts prior for machine learning, as the name suggests, the book is not for you if you're looking for data science, you check the other books page to find that. I also included advanced topics to not limit you to the basics. Machine learning, algorithms, data science, etc. moy seem tough end boring, but as you handle more end more data, you'll ploy with it!
(contents)
03 CHAPTER
06
CHAPTER
PANDAS
pandas
• Features of Pandas Library • Series • Data Frames
MATPLOTLIB l
• Features of matplotlib
matpMib . Data visualizationp
• PyPlot in matplotlib
1
1
(contents) SCIKIT LEARN
k
• Features of Scikit-learn library • How to work with data? • Why use Python?
CHAPTER
08 CHAPTER 1APTER
k
CHAPTER
k
b?
O
TYPES OF MACHINE LEARNING • Supervised learning • Unsupervised learning • Deep learning
1
SCIKIT LEARN ALGORITHMS
1
CHAPTER
• Regression algorithm • Classification algorithm • Clustering algorithm
CHAPTER
• Importing CSV data • Importing JSON data • Importing Excel data
IMPORTING DATA
r
DATA OPERATIONS
CHAPTER V
1
MATHEMATICS FOR MACHINE LEARNING
IDIID • Data instances I I I • Statistics • Probability
I I I I I
1
• NumPy operations • Pandas operations • Cleaning data
I 1
(contents) DATA ANALYSIS 8 PROCESSING
CHAPTER
k
14
CHAPTER
r
16
CHAPTER
• Data analytics • Correlations between attributes • Skewness of the data
DATA VISUALIZATION • Plotting data • Univaritae plots • Multivariate plots
CLASSIFICATION • Decision tree • Linear regression • Naive Bayes
1 1
1
(contents)
20 CHAPTER
PERFORMANCE 8 METRICS • Calculating the model • Improving the model • Saving and loading models
(contents)
MACHINE LEARNING U1 INTRODUCTION • What is Machine Learning? • Use of Machine Learning • How Machines Learn?
o
(— QJ.Dr--------------------'--- '□--------------------
A
MACHINE LEARNING INTRODUCTION y
What is Machine Learning? //
Data is what you need to do ANALYTICS, Information is what you need to do BUSSINESS. Commonly referred to as the “OiL of the 21st century" our digital data carries the most importance in the field. It has incalculable benefits in business, research and our everyday lives. Machine Learning is the field of computer science where machines provide meaning to the data like we humans do. Machine Learning is an type of artifical intelligence which finds patterns in raw data through various algorithms and perform predictions like humans. Machine Learning also means machines learn on their own. To better understand it think of a new born child and refer it to a machine learning model. The parents cannot teach them everything that's why they leave them to schools which can be you in this case with the machine learning model. The school has text books, tests, etc. to help you learn on your own which
w data
'
•/ ]_4 MACHINE LEARNING INTRODUCTION
'
)
Uses of Machine Learning Organizations are investing heavily in technologies like Artificial Intelligence, Machine Learning and Deep Learning to get the key information from data to perform several real-world tasks and solve problems. We can call it data-driven decisions taken by machines, particularly to automate the process. These data-driven decisions can be used, instead of using programing logic, in the problems that cannot be programmed inherently. The fact is that we can't do without human intelligence, but other aspect is that we all need to solve real-world problems with efficiency at a huge scale. That is why the need for machine learning arises. Followings are some of it's applications in real world: Forecasting weather of a day beforehand through finding patters of weather in the data of weather of previous days
Predicting the future prices of stocks in stock market
suggesting a product to a customer in an oniine store according to the users previous search terms
'
•/ J.5 MACHINE LEARNING INTRODUCTION
' '
How do machines learn? So what magic happens that machines learns like us and perform tasks? Let's understand that by an example. Let's say you are a new computer dealer. You have very basic experience in it. So you ask another dealer and obtain information. You summarized the following points to be important like the processor cores, ram and gpu. The dealer tells you these and you learn in return. Then you are provided with following data about 8gb ram of different brands: 3500 „ 3000 o’ c I 2500
60
65
70
75
80
85
Price
RAM
frequency
By observing the data we can tell that the price increases with the increase in frequency speeds. You understand an simple logic behind the data. Then if you get ram with the following specifications you can tell it's price, like a ram with 1666 Mhz of frequency so its price is 60.
'
•/ 16 MACHINE LEARNING INTRODUCTION
'
) But what if you get an frequency that you don't have record of like 2600Mhz? Then you have to learn how to decide the prices. We start to find a way to calculate with the given data. We assume that their is a linear relationship between the two. We define the relationship as a straight line as shown below:
Price
RAM
frequency
Now we can use the line as reference and predict values. SOj for 2600Mhz the cost will be about 77 So how do we draw out the line. We follow the formula cost - a + b * Mhz, but what are a & b? a and b are parameters of the straight line which you don't need to sweat about.
'
'
•/
]_7
MACHINE LEARNING INTRODUCTION
Likewise in machine learning, the machine i.e. computer learns the patters or relations in the data through algorithms and predict values when new value is asked. So will there be no errors? Definitely it will predict wrong than the actual answer. We also do many mistakes but learn from our mistakes or change our tutor if the result stays negative. Machine learning models too learn from their mistakes and change algorithms when results are not improved.
■x SETTING-UP ENVIRONMENT • Installing Anocondo • Jupyter notebook • Working with Jupyter notebook
A
021/ X__
SETTING-UP ENVIRONMENT
__________________________________ J
Installing Anaconda Head to anaconda.com/products/individuat to download the latest version of Anaconda.
Anaconda Installers Windows ■■
MacOS «
Linux A
64-Bit Graphical Installer (466 MB)
64-Bit Graphical Installer (462 MB)
64-Bit (x86) Installer (550 MB)
32-Bit Graphical Installer (397 MB)
64-Bit Command Line Installer (454 MB)
64-Bit (Power8 and Power9) Installer ( MB)
You can download the anaconda-installer for your system, whether it is Windows, Mac or Linux. After installing it, just run the installer and install
Search the web
Anaconda Prompt (anaconda3)
P anaconda Prompt (anaconda3) - See
App
web results
P anaconda prompt anaconda3 CT
yP anaconda prompt anaconda3 conda
yP anaconda prompt anaconda3 uninstall
Open
c0 Run as administrator
>
Open file location “P3 Pin to Start
Pin to taskbar ®
P anaconda Prompt (anaconda3)
Uninstall
*
•/
20
SETTING-UP ENVIRONMENT
---------
This is the Anaconda Command Prompt, from where we can run programs or perform other operations using code's as commands. Anaconda Prompt (anaconda3)
(base) C:\Users\Rahul>
Jupyter Notebook
jupyter
The lupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. We will use it to perform our data processing, analytics and visualization, etc. on the go. To open lupter Notebook, write jupyter notebook in the anaconda command prompt and press enter. 5 Anaconda Prompt (anaconda3)
(base) C:\Users\Rahul>jupyter notebook
□
>
21 -----
SETTING-UP ENVIRONMENT
1
fS’ Anaconda Prompt (anaconda3) - jupyter notebook
(base) C:\Users\Rahul>jupyter notebook [I - JupyterLab extension loaded from C:\Users\Rahul\anaconda3\lib\site-packages\jupyterlab NotebookApp] NotebookApp] JupyterLab application directory is C:\Users\Rahul\anaconda3\share\jupyter\lab [I NotebookApp] Serving notebooks from local directory: C:\Users\Rahul [I NotebookApp] The Jupyter Notebook is running at: [I NotebookApp] http://localhost:8888/?token=bda34ad58a2f2015a03f835d458f95010541adf58866ccf7 [I NotebookApp] or’http://127.0.9.1:8888/?token=bda34ad58a2f2015a03f835d458f95010541adf58866ccf7 [I NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [I NotebookApp] [C
To access the notebook, open this file in a browser: file:///C:/Users/Rahul/AppData/Roaming/jupyter/runtime/nbserver-12332-open.html Or copy and paste one of these URLs: http://localhost:8888/?token=bda34ad58a2f2015a03f835d458f95010541adf58866ccf7 or http://127.0.0.1:8888/?token=bda34ad58a2f2015a03f835d458f95010541adf58866ccf7
J
Anaconda will redirect you to your browser, [it may ask you, in which browser to host your jupyter notebook if you have more than one browsers] a new tab will appear with your Jupyter notebook hosted. You can host your Python files here, and also run the code on the fly. P’ jupyter Files
Running
Clusters
Select items to perform actions on them.
o
|
|
-
to/
Upload Name ♦
5 days ago
Co anaconda3
4 days ago
Co ansel
5 days ago
Co Contacts
5 days ago
Co Creative Cloud Files
4 days ago
Co Desktop
3 days ago
Ca Documents
a day ago
Ca Downloads
10 minutes ago
Ca Favorites
5 days ago
Ca Links
5 days ago
Ca OneDrive
New »
Last Modified
C3 3D Objects
Ca Music
Logout
Quit
C
File size
a day ago
an hour ago
Ca Pictures
5 days ago
Ca Saved Games
5 days ago
Ca Searches
5 days ago
Ca Videos
4 days ago
S Newipynb
4 days ago
In Jupyter Notebook, we donJt need to install any other module, package or library externally everything we need is already present here and the best thing is that you can code online without installing any IDE or the Python Interpreter, which makes it the best choice for data scientists.
72 B
•/ ' 22 DATA SCIENCE INTRODUCTION —J
'
Working with Jupyter Notebook To start coding, click on New and select Python 3 to open a new Python file. jupyter Files
Quit
Clusters
Running
Select items to perform actions on them.
□ 0
-
Upload
to /
5 days ago
□ Ca anaconda3
4 days ago
□ CJ ansel
5 days ago
□ Ca Contacts
5 days ago
Ca Creative Cloud Files
Quit
Running
-r
Logout
Clusters ______ I I —I I_____ Create a new notebook with Python 3
Select items to perform actions on them. □ 0
*/
Name 4
:e
Python 3
□ Ca 3D Objects □ Ca anaconda3
Text File
□ Ca ansel
Folder Terminal
□ Ca Contacts
4 days ago
□ Ca Creative Cloud Files
This is the place where we will write our code [in the cell] and run it. JUpyter
C
File size
4 days ago
C jupyter Files
New»
Last Modified
Name
□ O 3D Objects
□
Logout
t*
Untitled Last Checkpoint: a few seconds ago (unsaved changes)
Visit repo
Kernel
Trusted
O GitHub
% Binder
Copy Binder link | Python 3 O
Memory: 168/2048 MB
If you cannot create new file or encounter any error, you can head directly to jupyter.org/try and choose Python. Try Classic Notebook
Try JupyterLab
Try Jupyter with Julia
jupyter
A tutorial introducing basic features of Jupyter notebooks and the I Python kernel using the classic Jupyter Notebook interface.
JupyterLab is the new interface for Jupyter notebooks and is ready for general use. Give it a try!
A basic example of using Jupyter with Julia.
23
SETTING-UP ENVIRONMENT
--------------- '•
'
We can rename our file, by clicking the name [untitled] jupyter Untitled Last Checkpoint: a few seconds ago Edit
File
E
®
+
I
View
ft
Kernel
Insert
Cell
*4-
H Run
■
(unsaved changes)
Help
Widgets
»
C
| Python 3 O
Trusted
v
Code
Q
i Download
A
A
O GitHub
% Binder
Memory: 168/2048 MB
In [ ]: Q
We have only one code cell, in this cell we will write our code jupyter New Last Checkpoint: 2 minutes ago
3
B
View
Edit
File +
ft
»:
ft
Insert
Cell
♦4’
H Run
Widgets
C
» |code
■
f®
(autosaved)
Kernel
Help
Visit repo
Copy Binder link
Trusted
v|
D
± Download
A
A
O GitHub
% Binder
| Python 3 O
Memory: 219/2048 MB
In [ ]:
There are three type cells - code cells, markdown cells and raw cells. We can use markdown cells to display headings or titles. New Last Checkpoint: 7 minutes ago (unsaved changes)
J U py ter
B
View
Edit
File
®
+
t
I
Insert
Cell
+4'
H Run
Kernel ■
Widgets
Markdown
C
Trusted
Help
v
o
i Download
H
d
O GitHub
% Binder
✓ | Python 3 O
Memory: 219/2048 MB
# Jupyter Notebook
Now run the cell by clicking the run button on the header. New Last Checkpoint: 7 minutes ago (unsaved changes)
^jupyter
Edit
File
B
+
View
9®
C
Insert ♦
4
Cell
Kernel
H Run | ■
C
Widgets »
Jupyter Notebook I" [ ]:
Code
Trusted
Help
v
ra
i Download
41
d
O GitHub
% Binder
✓ | Python 3 O
Memory: 219/2048 MB
•/ 24 SETTING-UP ENVIRONMENT -----
*
In code cells, we can write Python codes and execute them instantly. 3
jupyter New Last Checkpoint: 12 minutes ago View
Edit
File
B
3^
+
C
I?)
Cell
Insert
♦
*
H Run
Kernel ■
Help
Widgets
Trusted
v
Code
»
C
Visit repo
(autosaved)
Q
± Download
a a
O GitHub
Copy Binder link
✓ | Python 3 O
Memory: 119/2048 MB
% Binder
Jupyter Notebook In [ ]: 123*525|
New Last Checkpoint: 13 minutes ago (unsaved changes)
^.JUpyter
B
View
Edit
File
H Run
4>
+
®
+
Cell
Insert
Kernel
Widgets
C
■
Trusted
v
Code
H
Visit repo
Help E3
i Download
a
a
O GitHub
Copy Binder link
| Python 3 O
Memory: 119 / 2048 MB
% Binder
Jupyter Notebook In [1]: 123*525 Out[l]:
64575
To insert a new cell below the selected cell, press b on your keyborad or click the + icon. 3
jupyter New Last Checkpoint: 17 minutes ago Edit
File
B
View ®
+
Insert *
6
4-
Cell
H Run
■
C
►*
f®
(unsaved changes)
Widgets
Kernel
Help
Trusted
*
Code
Visitrepo
a
± Download
a
a
O GitHub
Copy Binder link
| Python 3 O
Memory: 119/2048 MB
% Binder
Jupyter Notebook In [1]: 123*525 Out[l]: 64575
In [ J:
You can select [blue] or edit [green] a cell, by clicking outside the text feild or inside the text feild respectively. 3 jupyter New Last Checkpoint: 19 minutes ago File
a I +
Edit 3-
View ®
t
Insert +
4-
Cell
H Run
■
c
H
Jupyter Notebook
•®
(autosaved) Widgets
Kernel
code
Trusted
Help
£ Download
a
a
O GitHub
% Binder
| Python 3 O
Memory: 119/2048 MB
SETTING-UP ENVIRONMENT
We have more access to the markdown cells, to diplay texts more gracefully. We can add headings, sub-headings and lower-headings, using # 1, 2 and 3 times followed by space and then text respectively.
Jupyter Notebook IPython Data In [ ]:
We can create ordered and bulleted lists 1. Data 1. 2. 3.
Science Python Jupyter Notebook Libraries
1. Data Science A. Python B. Jupyter Notebook C. Libraries
z - Data Science ------
Ordered List ____________ J -----------------
\
* Python * Jupyter Notebook * Libraries • Data Science ■ Python ■ Jupyter Notebook ■ Libraries
BuLLeted List _____________ d
To create an ordered list, use 1 for the first list item and then use tabspace for the sub-list items and use correct numbering. [The text should be written followed by space after the numbers] To create a bulleted list, use - for square bullets and * for round bullets, and same manner as above for list and sub-list items.
*
•/ 26 SETTING-UP ENVIRONMENT
-- '•
We can also links, using [] & (). Write the display text in [] and put the link in (), you can also add a hover text inside of ( ) using " ” quotes. TJupter Notebook for Pvthonl(httDs://iupvter.ore/trv "Try it!") Jupter Notebook for Python
We can also use **** or
to render bold text and ** or __ to render italicized text
We can also insert images by going to the Edit>Insert Image and browse your image to enter it File
Edit
View
Insert
Cell
Kernel
Widgets
Trusted
Help
| Python 3
Create tables using | and strictly following the below example |Product|Price|Quantity| |----------- 1-— |-.............. | |Biscuits|5|2| |Milk|7|5L| Product Price Quantity
Biscuits
5
2
Milk
7
5L
I
27 V
SETTING-UP ENVIRONMENT z
HereJs a complete list of shortcuts of various operations with cells. r
’I
Operations change cell to code change cell to markdown change cell to raw close the pager restart kernal copy selected cell cut selected cell delete selected cell enter edit mode extend selection below extend selection above find and replace ignore insert cell above insert cell below interrupt the kernal Merge cells paste cells above paste cells below run cell and insert below run cell and select below run selected cells save notebook scroll notebook up scroll notebook down select all show keyboard shortcuts toggle all line numbers toggle cell output toggle cell scrolling toggle line numbers undo cell deletion
Shortcut y m r Esc 0 + 0 c X
d + d Enter Shift Shift f Shift a b i + i Shift Shift
+ j + k
+ m + v
V
Alt + Enter Shift + Enter Ctrl + Enter Ctrl + s SHIFT + Space Space Ctrl + a h Shift + 1 0 Shift + o 1 z
■x
PANADAS UO LIBRARY • Features of Pandas library i'll • Series • Dataframes
I'1 pandas
o
J
/•
A
03
k _______ /
PANDAS LIBRARY ________________________________________ >
Pandas Data science requires high-performance data ma nipulation and data analysis, which we can achieve with Pandas Data Structures.Python with pandas is in use in a variety of academic and commercial domains, including Finance, Economics, Statistics, Advertising, Web Analytics, and more. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data - load, organize, manipulate, model, and analyse the data.
Key features of Pandas library We can achieve a lot with Pandas library using its features like: • Fast and efficient DataFrame object with default and customized indexing. • Tools for loading data into in-memory data objects from different file formats. • Data alignment and integrated handling of missing data.
'
if
30 —
' PANDAS LIBRARY
• Label-based slicing, indexing and subsetting of large data sets. • Columns from a data structure can be deleted or inserted. • Group by data for aggregation and transformations.
Series Pandas deals with data with itJs data structures known as series, data frames and panel. Series is an one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers
10
17
23
55
67
71
92
As series are homogeneous data structure, it can contain only one type of data [here integer]. So, we conclude that Pandas Series is: • It is a homogeneous data structure • Its size cannot be mutated • Values in series can be mutated
Data Frames DataFrame is a two-dimensional array with heterogeneous data.
Day
Sales
Monday Tuesday Wednesday Thursday
33 37 14 29
31
PANDAS LIBRARY
The data shows the sales of certain product for 4 days. You can think of Data Frames a container for 2 or more series. So, we conclude that pandas data frames is: • It can contain heterogeneous data • Its size is mutable • ALso its data is mutable.
We will use Pandas series and data frames a lot in the future lessons, make sure to go through the lesson again and get the grasp of it.
Key Points • Pandas library is a high performance data manupilation and data analysing tool. • Pandas data structures include series and data frames • Series is a 1-Dimensional array of homogeneous data, whose size is immutable but values in a series are mutable. • Data Frames is a 2-Dimensional array of heterogeneous data of 2 or more series, whose size and data are mutable.
■x
f"\/| NUMPY MH PACKAGE • Features of NumPy • ndarray Objects • List vs. ndarrays
o
J
A
04
NUMPY PACKAGE _________________________________________>
NumPy NumPy is a Python package which stands for 'Numerical Python'. It is a library consisting of multidimensional array objects and a collection of routines for processing of array.
3D array
2D array ID array 7
2
9
10
5.2
3.0
4.5
9.1
0.1
0.3
axis 0
axis 1
shape: (4,)
shape: (2, 3)
Key features of NumPy NumPy is powerful that consists of many features like : • Mathematical and logical operations on arrays. • Fourier transforms and routines for shape manipulation. • Operations related to linear algebra. NumPy has in-built functions for linear algebra and random number generation. • NumPy ndarrays are much much faster than Python Built-in lists and less memoray consuming. • Most of the part that requires fast computation are written C and C++
34
NUMPY PACKAGE •
ndarray objects NumPy aims to provide an array object that is up to 50x faster that traditional Python lists. The array object in NumPy is called ndarray, it provides a lot of supporting functions that make working with ndarray very easy. Arrays are very frequently used in data science, where speed and resources are very important. In NumPy, we can create 0-D,l-D,2-D and 3-D ndarrays.
0-D
(33)
1- D
([11,27,18])
2- D
([ 3, 5,6], [5, 7,11])
3- D
([ 5,8,19], [ 6, 9,10],
[4,1,11]) In breif ndarrays or n-dimensional arrays are: • It describes the collection of items of the same type. • Items in the collection can be accessed using a zero-based index. • Every item in an ndarray takes the same size of block in the memory. • Each element in ndarray is an object of data-type object (called dtype). Any item extracted from ndarray object (by slicing) is represented by a Python object of one of array scalar types.
35
NUMPY PACKAGE •
Lists vs. ndarray In Python we have lists that serve the purpose of arrays, but they are slow to process. NumPy aims to provide an array object that is up to 50x faster that traditional Python lists.
ndarrays
Lists • List is an array of heterogeneous objects • List arrays are stored in different places in the memory which, makes it slow to process data. • Lists are not optimized to work with latest CPU's • A 1-Dimensional List
['A',56,67.05]
• ndarray is an array of homogeneous objects • ndarrays arrays are stored in one continuous place in the memory which, makes it faster to process data. • ndarrays are optimized to work with latest CPU's • A 1-Dimensional ndarray
([ 12, 17, 25])
Lists arrays
memory Loe -12044567 memory too -12044568 memory too -12044569
0 x 310718 0 x 310719 0 x 310720 0 x 310721 0 x 310722 (----------------------------------------------------- X
List arrays memory allocation X_______________________________ /
1
0 x 310723 0 x 310726
memory too -12044570 memory too -12044571 memory too -12044572 memory too -12044573 memory too -12044574 memory too -12044575 memory too -12044576 memory too -12044577 memory too -12044578
36
NUMPY PACKAGE
-
ndarrays PyObject_Head
1
data
2 3 “7
dimensions strides
4 5 6 7
(--------------------------------
ndarrays memory allocation
\_____________________________ /
You can clearly understand why the built-in list arrays are slower than ndarrays. To accelerate and process data much faster we will use NumPy in future lessons., make sure to geta hold of it.
r
Key Points • NumPy stands for Numerical Python, which is a Python Package used for working with arrays. • It also has functions for working in domain of linear algebra, fourier transform, and matrices. • ndarrays or n-dimensional arrays are homogeneous arrays, which are optimized for fast processing. • ndarrays also provide many functions that makes it suitable to work with data
■x
SCIPY UD PACKAGE • Features of SciPy • Data Structures • SciPy Sub-Packages
o
1
A
/-----------------------------------------------------
05
SCIPY PACKAGE J
k
SciPy The SciPy library of Python is built to work with NumPy arrays and provides many user-friendly and efficient numerical practices such as routines for numerical integration and optimization. Together, they run on all popular operating systems, are quick to install. /------------In [1]:
#Import packages from scipy import integrate import numpy as np def my_integrator(a,b,c): my_fun = lambda x: a*np.exp(b*x)+c y,err = integrate.quad(my_fun,0,100) print(’ans: %1.4e, error: %1.4e' % (y,err)) return(y,err) z
NumPy
#CaLL function my_integrator (5,-10,3)
ans: 3.0050e+02, error: 4.5750e-10
Out[l]:
(300.5, 4.574965520082099e-10)
\_________________________
Key features of SciPy SciPy combined with NumPy results a powerful tool for data processing with features like: • The SciPy package contains various toolboxes dedicated to common issues in scientific computing. Its different submodules correspond to different applications, such as interpolation, integration, optimization, image processing, statistics, special functions, etc. • SciPy is the core package for scientific routines in Python; it is meant to operate efficiently on NumPy arrays, so that numpy and scipy work hand in hand. • SciPy is organized into sub-packages covering different scientific computing domains, which makes it more efficient.
/
39
SCIPY PACKAGE
Data structures The basic data structure used by SciPy is a mul tidimensional array provided by the NumPy module. NumPy provides some functions for Linear Algebra, Fourier Transforms and Random Number Generation, but not with the generality of the equivalent functions in SciPy. Except for these, SciPy offers Physical and mathematical constants, fourier transform, interpolation, data input and output, sparse metrics, etc. Dense Matrix
Sparse Matrix
1
2
31
2
9
7
34
22
11
5
1
11
92
4
3
2
2
3
3
2
1
11
3
9
13
8
21
17
4
2
1
4
8
32
1
2
34
18
7
78
10
7
9
22
3
9
8
71
12
22
17
3
13
21
21
9
2
47
1
81
21
9
21
12
53
12
91
24
81
8
91
2
61
8
33
82
19
87
16
3
1
55
54
4
78
24
18
11
4
2
99
5
13
22
32
42
9
15
9
22
1
3 4
9
3 2
1
8
3
54
21
9
4
21
1
1
17
1
1
9
13
4
2
47
1
19
8
16
81
21
9
55
11
2
22
21
Use of Sparse matrix _________ __________ J
SciPy sub-packages As we already know, SciPy is organized into sub-packages covering different scientific comput ing domains, we can import them according to our needs rather than importing the whole library. The following table shows the list of all the sub-packages of SciPy : [next page]
z—• 40 -
r--------------------------------------
SCIPY PACKAGE
scipy.constants
Mathematical constants
scipy.fftpack
Fourier transform
scipy.integrate
Integrate routines
scipy.interpolate
Interpolation
scipy.io
Data input and output
scipy.linalg
Linear algebra routines
scipy.optimize
Optimization
scipy.signal
Signal processing
scipy.sparse
Sparse matrices
scipy.spatial
Spatial data structures
scipy.special
Special mathematics
scipy.stats
Statistics
Key Points • SciPy Package is a toolbox which is used for common scientific issues. • SciPy together with NumPy creates a dynamic tool for data processing. • Along with NumPy functions, SciPy provides a lot of functions to perform different tasks with ndarrays. • SciPy is divided into sub-packages determined for different tasks.
■x
r\OMALPLOTLIB MO LIBRARY
’SSOf matplstlib • Data Visualization • PyPlot in Matplotlib
o
A
06J \______ /
MATPLOTLIB LIBRARY _____________________________J
MatPlotLib Matplotlib is a python library used to create 2D graphs and plots by using python scripts. It has a module named pyplot which makes things easy for plotting by providing feature to control line stylesj font properties., formatting axes etc. 50
40 30 20
■
*
Thur
Fri
10
Sun
Sat
X
10 8
• • •• ••*
6
10
20
30
Key features of MatPlotLib Matplotlib is the best choice for data visualization because of its features like: • It supports a very wide variety of graphs and plots namely - histogram, bar charts, power spectra, error charts, and many more. • It is used along with NumPy to provide an environment that is an effective open source alternative for MatLab. • Using its PyPlot module, plotting simple graphs or any other charts is very easy.
40
50
43
MATPLOTLIB LIBRARY •
Data Visualization Data visualization is the graphical representa tion of information and data. By using visual elements like charts, graphs, and maps, data visu alization tools provide an accessible way to see and understand trends, outliers, and patterns in data.
In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of information and make data-driven decisions. Data visulaization helps us to view data in a graphical or more interesting way rather than viewing a big chunk of numbers in a uniform line. We will process, analyze and then visualize our data, if we don't visualize our data, it loose a lot of impact as it will in the form bar graphs, pie charts, etc.
44
MATPLOTLIB LIBRARY
PyPlot in Matplotlib matptottib.pyptot is a collection of functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc. To test it yourself, jump to lupyter Notebook and start of by importing the matplotlib. pyplot module. In [ ]:
import matplotlib.pyplot as mplt
To plot a simple graph, use the plot function and pass a list, and then use the show function to view the graph
We have successfully plotted our graph with some random values in a list. If we want we can name x and y axis using the xtabet and ytabet repectively.
Z—• 45
MATPLOTLIB LIBRARY
In [2]: import matplotlib.pyplot as mplt mplt.plot([l,3,6,9]) mplt.xlabel('X_Axis') mplt.ylabel('Y_Axis') mplt.show()
The graph has solid blue line, we change itJs color and the line style by passing another argument to the plot function like, 'ro' for 'r' red and 'o' circles. In [2]:
import matplotlib.pyplot as mplt mplt.plot([l,3,6,9],’ro*) mplt.xlabel(’X_Axis’) mplt.ylabel('Y_Axis') mplt.show()
•
9
8 7
Scikit Learn or Sklearn Scikit-learn or Sklearn is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib.
Features of Sckikit Learn Scikit-learn focuses on modelling data. The followings are the most popular groups provided by the library: • Supervised learning algorithms, like Linear Regression, Support Vector Machine (SVM), Decision Tree etc. are the part of scikit-learn.
© $ ® • Unsupervised learning algorithms like clustering, factor analysis, PCA (Principal Component Analysis) to unsupervised neural
50
SCILKIT LEARN LIBRARY
• Cross Validation, Dimensionality Reduction, Ensemble methods, Feature extraction, Feature selection are also features of scikit learn that are used to check the accuracy of supervised models, reducing the number of attributes in a data, combining the prediction of multiple supervised models, extract features and identify useful featurews in adata, respectively.
no TYPES OF UO MACHINE LEARNING Supervised leorning • Unsupervise leorning • Deep learning • Reinforcement leorning • Deep reinforcement learning
o-------------
A
08
TYPES OF MACHINE LEARNING _____________________________ )
In the previous lessons we learned about the various libraries and packages used in the process of machine learning. Now letJs look at the types of machine learning
Types of machine learning The followings are the different type of machine learning: reinforcement Learnifig
{J^pervised Learning
^supervised Learning
deep reinforcement Learnirfg Deep Learning
Supervised learning As it's name suggest in supervised learning we train a machine learning model with supervision. We feed the data, train the model and tell the model to make predictions. If the predictions are correct we leave the model to work on itJs own from there else we help the model to predict correctly until it learns so. It is the same as teaching a child to solve questions at first until he can solve them on his own.
53
TYPES OF MACHINE LEARNING
Types of supervised learning Regression and classification are two types of supervised machine learning. They can be under stood as: • Regression, is the type of machine learning in which we feed the model with data like rA' (input, i.e. X) has value of 65 (output, i.e. Y), fB' has value of 66, etc. Based on the given data, the model learns the relation between the input and output (here fA' & 65). Once the machine is trained with sufficient data we provide a input let's say rC' and let the model predict the output, but you must know the real output of that input. You check the prediction with the real value and check whether it is correct or wrong. If the predictions are correct we pass the model. If the predictions aren't
54
TYPES OF MACHINE LEARNING
Regression inturn have different ypes like linear and logistic regression, which we will learn in it's separate lesson. • Classification, is the type of machine learning in which we feed data and the model classifies the data into different groups. Consider the following example,
the data has different type of shapes in it. We will teach the model which is what shape or what are the different groups in the data. We will provide the groups with their features like:
/
circle
square
• • \_____________
■
■ ■
oval
rounded squares
■ ■ J
55
TYPES OF MACHINE LEARNING
Now the trained model can classify any data after learning how the groups are formed. If a new shape is passed it will classify it according to what it has learned. Like regression, we will keep feeding it data until it classifies the data correctly.
Classification has also different types like decision tree, Naive Bayes classification, support vector machines, etc. We will learn about them in the lesson dedicated for this topic.
56
TYPES OF MACHINE LEARNING
Unsupervised learning Unlike supervised learning., we don't teach or check the predictions made by the models, instead we feed the data and ask for predictions directly. And it is obvious that much data you'll feed the results will be much accurate. Unsupervised learning is used in artificial intelligence applications like face detection, object detection, etc.
Deep learning Deep learning models are based on Artificial Neural Networks (ANN), or more specifically Convolutional Neural Networks (CNN)s. There are several architectures used in deep learning such as deep neural networks, deep belief networks, recurrent neural networks, and convolutional neural networks. These models are used to solve problems like: • computer vision, image classification, etc. • bioinformatics • drug design • games, etc. Deep learning is entirely a concept in itself that it is completely different type of machine learning or different from it, we will not discuss it in much detail but let's see what is it and how do it solves problems. Deep learning requires a lot of data and computational power. But nowadays high performance
-
57
TYPES OF MACHINE LEARNING
computing is available to us. Let's consider an example where the deep learning model tells us about whether an animal is horse or not. The net work consists of large amount of horse photos as data and analyze them and try to extract patterns from it like horns, color, saddle, eyes, etc.
the neural networks came to conclusion whether the animal his horse or not, but how did it reach the conclusion is unknown. The reasoning cannot be obtained from deep learning models that's it is also considered as black box.
Reinforcement learning Reinforcement learning consists of learning models which doesn't require any input or output data instead they learn to perform certain tasks like solving puzzles or playing games. If the model performs steps correctly it is rewarded points, unless it penalized. The models learns the more it performs from it's mistakes. The model creates data as it performs the functions unlike receiving data at the beginning. For example consider the following model:
58
TYPES OF MACHINE LEARNING a
Deep Reinforcement learning As the name suggests this is the combination of reinforcement and deep learning. Reinforcements algorithms are combined with deep learning to create powerful Deep Reinforcement learning models that is used in fields like robotics, video games, finance and healthcare. Many unsolvable problems are solved by these models. DRL models are still at new to us and there is a lot to learn about it
■x
MATHEMATICS FOR Vv? MACHINE LEARNING • • • •
O
Doto instances Statistics Probability Bayes Theorem
i 11 IDDIDI
J
A
09
MATHEMATICS FOR ML _____________________________ )
Although every mathematical calculation will be performed by the computer, but you need to know the important formulaes and mathematical notations even if you-’re not solving them yourselves. In this chapter we will go through the important concepts in mathematics required for machine learning
Data instances Data is what we need to perform all the functions, i.e. data is the base of everything. You need to know what types of data is required in which process. LetJs consider the following as our data:
r
Day
Sales
Monday Tuesday Wednesday Thursday
33 37 14 29
In the above table there are two columns Day & Sales and four rows. In the data, we have two things, feature and label i.e. feature of the data (numeric values like 33) and labels of the data (descriptive values like Monday). Here Day is the label and Sales is the label.
Monday Tuesday Wednesday Thursday
37 14 29
MATHEMATICS FOR MACHINE LEARNING
The labels also have the following two types: • Nominal, these data aren't ordered. They have no heirarchy or upper or lower status. In the following data the labels have either True or False value i.e. considered nominal data
Answer
Question
True False False True
01 02 03 04
• Oridnal, these data are ordered. They have upper or lower status. In the following data the labels have an order in teh values like Good > Average > Bad i.e. ordinal data
Product ID 101 102 103 104
Rating Average Good Average Bad
Similarly the features have also two different types. The followings are the two different types: • Discrete, or finite values. These values have a limit for example in the following data the feature, numbers of children (NoC) in different families is finite to 1, 2 or 3 i.e. called discrete data
Family
NoC
Smith's Matrin's Cox' s Hyde's
2 1 3 2
MATHEMATICS FOR MACHINE LEARNING
• Continuous, or infinite values. These values doesn't have a limit for example in the following data the feature, weight of different people isn't finite. It could be 110 pounds, it can be 110.20 punds or even 110.21 pounds i.e. continuous data.
Person
Weight
Thon Max Mary Alex
110 122 96 120
1
Data Collection Data is collected from many sources. Let's say we want data of the whole countrey. So we need to survey the whole population whihc is time consuming or we can select a sample of the population and survey it's data. The sample can be selected randomly, on the basis characters or other features. Likewise instead of feeding the a whole data set to model we can obtain a sample from it to save time and have better results.
\____________________ 7 Population
MATHEMATICS FOR MACHINE LEARNING
Statistics Statistics is often thought of data visualization like bar graphs, etc. but statistics also include data collection, data analysis and it's represen tation. As you may have learn't in school, we perform statistical analysis of data like finding tge central tedency and visulaizing the data onto graphs. Descriptive and Inferntial stastics are used in machine learning.
Descriptive statistics In descriptive statistics we work with the whole data i.e. population rather than sample. In de scriptive statistics we have the followings: • Central Tendency The mean, median and mode of an data is refered as it's central tendency. We can find each of them very easily, let's consider the following data and find it's central tendency,
Day
Sales
Monday Tuesday Wednesday Thursday
33 37 14 29
To find the mean or the average of the sales, we need to add all the values together and divide it with the total number of values sum of all values 33 + 37 + 14 + 29 ->Q nc mean (x) = ----------------------- = ----------------------- = 28.25 Total no. of values 4 the average or mean sales is 28.25. As mentioned earlier you don't need to perform the calculations, it will be done in computer. Even there is
MATHEMATICS FOR MACHINE LEARNING
seperate functions in the pandas library like pandas.mean() to find it, you just need to know about it and how the value is obtained, so you understand what happens in which analysis. To find the median or the middle value sort the numbers and if the data has odd numbers of data, the middle value is the median like
12 34(56)71 77 56 is the median of the above data. But if the number of data is even like our sales data we need to find the sum of the middle pair and divide that by 2
x sum of mid pair median (x) = --------------------2
29 + 33 2
And at last we have the mode or the most occuring value. It can be visually calculated but in our data we don't we any repeating values so we will consider the following example
®34(l2)71 77©56 78
12 is the mode in the above data as it ocurred for three times, there can be many repetitions in a data
65
MATHEMATICS FOR MACHINE LEARNING
• Variability or Spread Range, interquratile range, variance and standard deviation are referred to variability or measurement of spread
Spread
Range
Interquatile Range
>""■...... *"■......\ Variance
1
Standard Deviation
Range is the difference between the maximum and minimum value in a data. Like the range of the sales data is Z
range = max - min = 37-14 = 23
I______________ 7
Interquartile Range is similar to the range but a bit different. Let's consider teh following data
12 27 33 35 35 42 45 47 51 53 54 We will divide the data into quarters with the numbers as separators
12 27)33 [35 35] 42 [45 47] 51 [53 54] And subtract the third seperator from the first seprator i.e. interquartile range
Interquartile range = 3rd Seperator -1 st Seperator = 51-33 = 18 \_____________________________________________________ 7
J
*
•/ gg
MATHEMATICS FOR MACHINE LEARNING
--- '•-----------------Variance or difference of random variables from the expected value can be obtained with the fol lowing formula where x is individual data points, x is mean and n is the total number of data values
fi(xt-x)1 2
s2 =
-----------------n
If you want you can find the variance of any data by replacing the values in the formula but make sure to remember how the variance is found Next if we want to find deviation i.e. or the difference of each value from itJs average or mean, we can use the following formula where i represents the number of values in the data and u represents the mean
Deviation = (xt- u) If we want to find the deviation of a data of a population we will use the mean of the whole population which is represented by u
o2 = (xr u)2 But if we want to find the deviation of a data of a sample from a population, we will use the mean of the sample instead of the whole population which is represented by x i.e. called inferential statistics s2 = (xrx)2
Similarly we can find the standard deviation or the dispersion of data from itJs mean through the following formula where N is the number of data points or values o
1
——I
(xi-u)2 1=1
I
67
MATHEMATICS FOR MACHINE LEARNING
LetJs consider an example to understand how to find standard deviation. We will find the standard deviation of our sales data
First we need to find the mean or u i.e. 28.25 which we found earlier
Then we will find the difference of all the sales data from the mean and square them and add them
*
•/ gQ
MATHEMATICS FOR MACHINE LEARNING
--- '•-----------------Finally we can find the root of the product of 1/N and 87.74 to find O
° = fP^74 o = J~jTx87-74
[N = 4]
o = /zT.93
o =
4.67
Therefore, the standard deviation of the sales data is approximately 4.67 and as mentioned earlier don't sweat on the calculation just unserstand the application of the formula
Entropy and Information gain Entropy or the uncertainity in a data can be found with the following problem: N
H(S) = -^pt.logzpt 1=1
where S stands for set of all instances in a data, N refers to the number of distinct values and p. stands for probability of the event. Through entropy we can further calculate information gain from a varaible through the following formula Gain(A, S) = H(S)
v j=i
IS -I x H(SJ) |S|
where A is a feature or variable whose information gain is being calaculated, H(S) is the entropy of the whole dataset,|Sj| is the number of instances with value j of the feature A, |S| is all data
I
69
MATHEMATICS FOR MACHINE LEARNING
instances in the dataset, v is the set of distinct values of the feature A, H(Sj) is the entropy of subset of data instances of feature A and H(A,S) is entropy of feature A of the dataset Let's consider an example to understand what is entropy and information gain more clearly. We have the following dataset Day
Discount
Advertisement
Sales
1
10%
No
Average
2
25%
No
Maximum
3
20%
Yes
Maximum
4
10%
Yes
Maximum
5
25%
No
Average
6
10%
Yes
Maximum
7
20%
No
Maximum
8
20%
No
Maximum
9
10%
Yes
Average
10
20%
Yes
Maximum
You are told to find the best feature i.e. Discount or Advertisement to have Sales as Maximum. So which feature will you choose to create a model to predict the best values to have maximum sales? We will find the information gain from each feature to figure that out. Let's find the information gain from Discount, we have the following details about Discount
z....................... \ Total values 10
Ix
Max 7
Discount
.
10%
z \
V
✓ J
z
z
z
Avg. 3
\
zZ
Max 2
z
Avg 0
\
z Max 1
z
I
70
MATHEMATICS FOR MACHINE LEARNING
Now we can find the entropy of the whole dataset using the entropy formula N
H(S) = -EPi-log2Pi i=1
So total values or N is 10 and propbability of Avg. Sales and Max Sales are 3/10 and 7/10 resepectively N
H(S) = -EpJog^ i=1
Max Sales Probability
H(S) = -37;log2-^- -T7rlog2-^10 y 10 10 y 10 Avg. Sales Probability
H(S) ~ 0.82
After obtaining the entropy we can substitute the values in the information gain formula to find the information gain from Discount feature Discount
Sales
10%
Average
25%
Maximum
20%
Maximum
10%
Maximum
25%
Average
10%
Maximum
20%
Maximum
20%
Maximum
10%
Average
20%
Maximum
= 0.82-0.4-0.0-0.2 = 0.82 - 0.6 = 0.22
The information gain from the Discount feature is 0.22, the feature with highest information gain values is used in models to predict values to get better results. So if Advertisement is the feature then information gain is Avg. 1 Yes V _
Advertisement
>
•/ 71
MATHEMATICS FOR MACHINE LEARNING
Similarly we can find the information gain of the Advertisement feature Advertisement
Sales
No
Average
No
Maximum
Yes
Maximum
Yes
Maximum
No
Average
Yes
Maximum
No
Maximum
No
Maximum
Yes
Average
Yes
Maximum
Gain(A, S) = H(S) - H(A,S) = 0.82- ^(410544-1094)
+ 4(-4^4-4^4) = 0.82 - 0.36 - 0.445 = 0.82 - 0.805 = 0.015
It is clear that we need to use the Discount feature rather than Advertisement feature because of more information gain. And again as mentioned earlier just understand what's going on, this is one of the important techniques for data scientists
Confusion matrix Confusion matrix is used to calculate the accuracy of an model. To calculate that we need to create the confusion matrix first, let's consider we created a model that predicts weather and we asked it to predict whether it will rain for the next 30 days. After 30 days, we matched the predicted values with the actual values and found that the model predicted 25 days correctly but 5 were incorrect. Here 8 is referrred as True negatives(T-), 3 is referred X /---------Predicted Predicted as False positives(F+), 2 is 30 no rain rain referred as False negatives Actual (F-) and 17 as True Positives 8 3 no rain (T+). We can use these values Actual to calculate the accuracy of 17 2 the model using the following Vrain____ J
formula: accuracy =
So the accuracy of our model is 17 + 8 accuracy = ----------------------17+8+3+2 25 accuracy = ------30 accuracy = 0.8
Similarly we can calculate the error rate or mis classification rate using the following formula: (F+) + (F-) Error rate =---------------------------(T+) + (T-) + (F+) + (F-) 3+2 Error rate = ----------------------17+8+3+2
Error rate = ------30 Error rate = 0.2 It is same as 1-Accuracy. Next, we can calculate the precision of the model using the below formula:
Precision = ---------------(T+) +(F+) Precision = ---------------17 + 3 Precision = ----20 Precision = 0.85
precision can also be defined as how many true positive predictions our model makes
'
•/ 73
MATHEMATICS FOR MACHINE LEARNING
We can also calculate the Recall using the below formula:
Recall = Recall = Recall =
(T+) (T+) +(F-)
17 17 + 2
17 19
Recall = 0.89 And finally we can calculate F-measure if models have high precision & low recall or vice versa using the below formula:
F-measure = F-measure = F-measure =
2 x Recall x Precision
Recall + Precision 2 x 0.89 x 0.85 0.89 + 0.85 1.51 1.74
F-measure = 0.86
Probability It is the easiest mathematical calculation to predict the outcome of an event using the following formula
r Favourable outcomes Probability of event A =------------------------------Total outcomes
There are three different concepts required for machine learning i.e. Probability Density Function, Noraml distribution and central limit theorem which are both statistics and probability
Probability Density Function The probability function states the following points: • It is continuous over the range • Area under the curve and the x-axis is equals to 1 • Probability of events will lie between a and b Any variable that satisfies these conditions is called continuous random variable
Normal Distribution Variable's (features) with mean as 0 and variance as 1 are called noraml random variables. A normal distribution is an arrangement of a data set in which most values cluster in the middle of the range and the rest taper off symmetrically toward either extreme. In an normal distribution mean, median and mode are same. We can represent the distribution as the below graph:
where uis mean and is ostandard deviation. The formula of normal distribution can be represented as:
y (Normal Variable) = [1/ox V2TT1 e(x'u)2/2°2 where e = 2.718
7g L—
MATHEMATICS FOR MACHINE LEARNING
Central Limit Theorem The theorem states that the mean u of samples from a population should be equal to the population mean
V Population mean
V
V
Samplel mean
J
Sample2 mean
!
- J
Sample3 mean
J
Types of Probability Probability can be classified into the following three types: • Marginial probability, probability of an event without conditions like drawing a number from the first ten natural numbers. • Joint probability, probability of two events at once like drawing a red card with 4 number from a deck of cards ^4 • Conditional probability, 4 number ----- red coLor probability of one or more event with conditions. The condition may be fulfilled already or need at the moment of the event. For example, drawing a joker card from your friend, where it may or may not be already present.
MATHEMATICS FOR MACHINE LEARNING
Bayes Theorem Bayes theorem is way of finding probability of an event when we know about the probility of other events or conditions. The formula is given as:
P(A|B) = P(B|A) P(A)/P(B) Let's say P(Fire) means how often there is fire, and P(Smoke) means how often we see smoke, then: • P(Fire|Smoke) means how often there is fire when we can see smoke • P(Smoke|Fire) means how often we can see smoke when there is fire
So we know the following probabilities: • dangerous fire = 1% • fire with smoke = 10% • dangerous fire with smoke = 90% Then probability of dangerous fire when there is smoke:
P(Fire|Smoke) = 1% x 90% / 10% P(Fire|Smoke) = 9% Using bayes theorem we can find many probabilities of events. Even there is an naive bayes theorem model in machine learning which we will learn in the upcoming lessons.
■x
1 SCIKIT LEARN IV ALGORITMS • Regression algorithm • Classification algorithm • Clustering algorithm
o
r
10 ___ /
SCIKIT LEARN ALGORITHMS _________________________________>
Algorithms We already learnt about the different types of machine learning algorithms like, supervised learning, etc. But how will you choose the best algorithm for your problem? For that purpose we can understand the algorithms we are going to use so we can decide which algorithm is suitable for our problem
Regression Algorithm Any scikit-learn algorithm requires data points or values more than 50. The following visual will help you to understand the working of the scikit learn's regression model which used to predict quantity:
['name','sales']]
name
sales
0
biscuits
227
1
cookies
158
2
cake
50
3
whey_supplement
24
4
protein_bars
85
5
potato_chips
121
or just with some rows In [46]:
import pandas as pan dt = pan.read_csv("csv_data.csv") dt.loc[4:6j'name'>'sales']]
0ut[46]:
name
sales
4
protein_bars
85
5
potato_chips
121
To access a single element, we can use its row-column index with the values function In [49]:
import pandas as pan
dt = pan.read_csv("csv_data.csv")
dt
In [50]:
id
name
price
sales
brand
0
101
biscuits
5.00
227
HomeFoods
1
102
cookies
7.25
158
TBakery
2
103
cake
12.00
50
TBakery
3
104
whey_supplement
34.90
24
Muslellp
4
105
protein_bars
4.90
85
MusleUp
5
106
potato_chips
1.75
121
HomeFoods
biscuits_sales = dt.values[0,3] biscuits_sales
Out[50]:
227
0
Z-------------
dt.values 1
2
3
4
price
sales
brand
name
Zl01
biscuits
102
^^ceoKies
7.25
158
TBakery
cake
12.00
50
TBakery
2 z*
dt,va~Lues[Q,3]
1
id
5^^(227) HomeFoodsJ
3
104
whey_supplement
34.90
24
MusleUp
4
105
protein_bars
4.90
85
MusleUp
5
106
potato_chips
1.75
121
HomeFoods
The data values are stored as ndarrays so, to access single elements we can using slicing similar to that of DataFrames
86
IMPORTING DATA
Importing JSON Data JSON file stores data as text in human-readable format. ISON stands for JavaScript Object Nota tion. Get your sample ISON data here
json_data (Check the Resources) Pandas can read ISON files using the read_json function. In [2]:
import pandas as pan dt = pan.read_json("json_data.json") dt
ID
Name
Price
Sales
Brand
0
101
Biscuits
5.00
227
HomeFoods
1
102
Cookies
7.25
158
TBakery
2
103
Cake
12.00
52
TBakery
3
104
Whey Supplement
34.90
24
Muslellp
4
105
Protein Bars
4.90
85
MusleUp
5
106
Potato Chips
1.75
121
HomeFoods
Similar to the CSV files., we can perform all the slicing and data extraction with JSON data files In [6]:
import pandas as pan dt = pan.read_json("json_data.json") print(dt.loc[:,["ID","Name","Sales"]]) print(dt["Name"]) print(dt.values[5,4])
Name Sales Biscuits 227 Cookies 158 1 2 Cake 52 3 Whey Supplement 24 4 Protein Bars 85 5 Potato Chips 121 0 Biscuits Cookies 1 2 Cake 3 Whey Supplement 4 Protein Bars 5 Potato Chips Name: Name, dtype: object HomeFoods
0
ID 101 102 103 104 105 106
87
IMPORTING DATA
Importing EXCEL Data Microsoft Excel is a very widely used spread sheet program. Its user friendliness and appealing features makes it a very frequently used tool in Data Science. Get your sample ISON data here
xtsx_data (check the Resources) The read_excet function of the pandas library is used read the content of an Excel file into the python environment as a pandas DataFrame. In [9]:
import pandas as pan dt = pan.read_excel("xlsx_data.xlsx") dt
id
name
price
sales
brand
Unnamed: 5
0
101
biscuits
5.00
227
HomeFoods
NaN
1
102
cookies
7.25
158
TBakery
NaN
2
103
cake
12.00
50
34
TBakery
3
104
whey_supplement
34.90
24
Muslellp
NaN
4
105
protein_bars
4.90
85
Muslellp
NaN
5
106
potato_chips
1.75
121
HomeFoods
NaN
As execel sheets are imported as Pandas DataFrameSj we can perform all the tasks on the excel data like Data Frames. You may notice., we have a Unnamed: 5 column with NaN values [except dt.value[2,5]]. Let's clean up our data. First we need to remove the Unnamed: 5 column, which we can do using the det keyword In [10]:
import pandas as pan dt = pan.read_excel("xlsx_data.xlsx") del dt["Unnamed: 5"]
As we have learned earlier the det keyword removes the whole column we don't need to deal with the Data Cleansing
88
IMPORTING DATA
We have removed the Unnamed: 5 column In [11]:
import pandas as pan dt = pan.read_excel("xlsx_data.xlsx") del dt["Unnamed: 5"] dt
id
name
price
sales
brand
0
101
biscuits
5.00
227
HomeFoods
1
102
cookies
7.25
158
TBakery
2
103
cake
12.00
50
3
104
whey_supplement
34.90
24
MusleUp
4
105
protein_bars
4.90
85
MusleUp
5
106
potato_chips
1.75
121
HomeFoods
(34)
Now, we need to replace dt.value[2,5] i.e. 34 with TBakery. We can use the reptace method In [12]:
import pandas as pan dt = pan.read_excel("xlsx_data.xlsx") del dt["Unnamed: 5"] dt.replace({34:"TBakery"})
id
name
price
sales
brand
0
101
biscuits
5.00
227
HomeFoods
1
102
cookies
7.25
158
TBakery
2
103
cake
12.00
50
TBakery
3
104
whey_supplement
34.90
24
MusleUp
4
105
protein_bars
4.90
85
MusleUp
5
106
potato_chips
1.75
121
HomeFoods
So our data is clean with no errors. Try recaping the chapter and attempt the Exercise, where youJll be provided with sample data files [links] with lots of errors and you have to perform all the data cleansing practised in the previous lesson, this will be a very good exercise to help you understand about data processing and cleansing more
1 O DATA 1C OPERATIONS • NumPy operations • Pandas operations • Cleaning data
o
/--------
12
data analysis
'---------Python handles data of various formats mainly through the two libraries., Pandas and Numpy. We have already seen the important features of these two libraries in the previous chapters. In this chapter we will see some basic examples from each of the libraries on how to operate on data and perform different tasks like cleaning the data, analytics, etc.
NumPy Operations To start working with NumPy, we need to import numpy to create NumPy arrays. In [ ]:
import numpy
Now let's create an array, using the arrayO function and print it. In [2]:
import numpy ar = numpy.array([l,5,7]) print(ar) [1 5 7]
an is a 1-Dimensional array, we can also create a 2-Dimensional array by creating one or more 1-Dimensional array inside of another array In [8]:
import numpy ar = numpy.array([[l,5,7], [2,3,9]]) print(ar) [[1 5 7] [2 3 9]]
[[1 5 7] I [2 3 9]]J
1-D array = 2-D array
- 91
data operations
-J
We can specify the dimension of an array during creation using the ndmin parameter In [10]:
import numpy ar = numpy.array([lj5,7]> ndmin = 2 ) print(ar) [[1 5 ?]]
Although we passed a 1-Dimensional array, it became a 2-Dimensional array because of the specification of the dimensions of the array in the ndmin parameter We created an array with integers so, let's create arrays with strings and floats using the dtype parameter with the same values In [11]:
import numpy ar_str = numpy.array([l,5,7], dtype = str ) ar_flt = numpy.array([l,5,7], dtype = float) print(ar_str) print(ar_fIt) [■r '5' ■?■] [1. 5. 7.]
ar_str is an array of string literals and ar_ftt is an array of floats. We can also change these numbers to complex numbers the same way using complex as dtype In [13]:
import numpy ar_str = numpy.array([l,5,7], dtype = str ) ar_flt = numpy.array([l,5,7], dtype = float) ar_cmx = numpy.array([l,5,7], dtype = complex ) print(ar_str) print(ar_flt) print(ar_cmx)
fl1 '5' -7’] [1. 5. 7.] [l.+e.j 5.+0.J 7.+0.j]
ar.str, ar_fT_t and ar_cmx are arrays created with same data, but with different data types as strings, floats and complex numbers repectively.
92
data operations
Pandas Operations Pandas handles data through Series,Data Frame, and Panel. We will learn to create each of these.
Pandas Series We already know what Pandas Series is. A pandas Series can be created using the SeriesO function so, let's import pandas and create series. import pandas sr = pandas.Series([1,5,7]) print(sr)
In [14]:
0 1 1 5 2 7 dtype: int64
As you can see our data is indexed form 0 to 2 with the data type printed as integer, we can specify our own indexes in the index parameter In [16]:
import pandas sr = pandas.Series([1,5,7], index = ['A','B','C']) print(sr)
A B
1 5
C
7
dtype: int64
Like ndarrays, we can also specify the data type in pandas series using dtype parameter during series creation In [18]:
import pandas sr = pandas.Series([1,5,7], dtype = complex )
print(sr) 0
1.000000+0.000000j
1
5.000000+0.000000j
2
7.000000+0.000000j
dtype: complexl28
f
93
data operations
We can use a ndarray to create a pandas series In [19]:
import numpy import pandas ar = numpy.array([1,5,7]) sr = pandas.Series( data = ar, copy = True ) #is same as sr = pandas.series(ar, copy = True) print(sr) Q 1 1 5 2 7 dtype: int32
We passed the ar ndarray as the data for the series [use of the data parameter isn-’t necessary, its just for better understanding] and also used the copy parameter to create a copy of the data. If you want to get the data, without the indexes use the values function In [21]:
import numpy import pandas ar = numpy.array([l,5,7]) sr = pandas.Series(ar) print(sr.values) [1 5 7]
You can print a more detailed version of the above using the array function In [22]:
import numpy import pandas ar = numpy.array([l,5,7]) sr = pandas.Series(ar) print(sr.array) ■
[1, 5, 7] Length: 3, dtype: int32
ray Type
[1, 5, 7] Lengtl
{vplues
ta type
You can use values or array function according to your needs whether you want just the values or summarized detail of the arrays in that panda series. Also note the difference in the array function in NumPy and Pandas.
94
data operations
Pandas Data Frames Pandas Data Frames aligns data in a tabular fashion of rows and columns. A pandas DataFrame can be created using the DataFrameO function, we need pass a dictionary as the data In [23]:
import pandas df = pandas.DataFrame({"Product":['CookiesBiscuits'], "Sales":[157,227]}) print(df)
0 1
Product Cookies Biscuits
Sales 157 227
Dictionary keys are the columns and their values are the content of the rows of the Data Frame. We can also use index parameter here In [24]:
import pandas df = pandas.DataFrame({"Product":['CookiesBiscuits'], "Sales":[157,227]}, index = [1,2]) print(df)
1 2
Product Cookies Biscuits
Sales 157 227
We can define the columns and it's data seperately using ndarrays In [42]:
import pandas import numpy ar = numpy.array([[l,3],[6,2]])
df = pandas.DataFrame(data = ar, index = ['A','B'], columns = [,C1','C2']) print(df)
A B
Cl 1 6
C2 3 2
The data is stored in the ndarray and the columns are defined in the DataFrame's columns parameter. Note that, a 2-Dimensional ndarray with 2 1-Dimensional arrays in it is passed to the data parameter to act as the data
95
data operations
We can add columns to the DataFrame using the
[] = syntax In [44]:
import pandas import numpy ar = numpy.array([[1,3],[6,2]]) df = pandas.DataFrame(data = ar, index = ['A','B'], columns = ['Cl','C2']) df['C3'] = (df['Cl']*5) print(df)
A B
Cl 1 6
C2 3 2
C3 5 30
We can delete columns from the DataFrame using the det function In [45]:
import pandas import numpy ar = numpy.array([[1,3],[6,2]]) df = pandas.DataFrame(data = ar, index = ['A','B'], columns = [,C1,,'C2']) df['C3‘] = (df['Cl']*5) del df['C2'] print(df)
A B
Cl 1 6
C3 5 30
We can print a column of the DataFrame using the
[] syntax In [46]:
import pandas import numpy ar = numpy.array([[l,3],[6,2]]) df = pandas.DataFrame(data = ar, index = ['A','B'], columns = ['C1','C2']) print(df['Cl']) A 1 B 6 Name: Cl, dtype: int32
96
•
data operations
Slicing Syntax To get a single element from a ndarray or pandas series or pandas dataframes, we need to use the slice syntax [start:end:step(optional)] LetJs extract some elements from the arrays we have created so far. In [59]:
import numpy as npy arl = npy.array([1, 5])
ar2 = npy.array([[1, 3], [5, 2]])
ar3 = npy.array([[[1, 3], [5, 2]],
[[2, 4],
[4, 6]]])
#SLicing 1-Dimensional, array
print(arl[0]) #SLicing 2-DimensionaL array
print(ar2[0,l]) #SLicing 3-Dimensional, array print(ar3[l,0j1])
1
3 4
We use a comma } to slice further in 2 or more dimensional arrays, the following figure will help you understand the slicing of the 3-Dimensional array more better
ar3[]
Full orroy
[[[1, 3], [5, 2]], [[2, 4], [4, 6]]]
97
data operations •
ar3[l]
[[[C 3], [5, 2]]j [[2, 4]j [4, 6]] ]
First Slice
ar3[l,0]
Second Slice ar3[l,0,l]
[[[1> 3], [5, 2]]j l[2> 4], [4j 6]] ]
[[[1> 3]j [5, 2]]j [[2, 4 ], [4j 6]] ]
ar3[l,0,l] Y
iFinol Slicej
Slicing may seem a bit tough for beginners due to the dimensions, thatJs why I created the figure to help you understand slicing better. If you are confident try solving the slicing questions in the Exercise
*
’ 98
data operations
To get a element from a pandas series., we use the
[ or ] syntax In [4]:
import pandas as pan
sr = pan.Series([1, 3, 5], index = ['a','b','c']) print(sr['a']) #impLicit indexing
print(sr[0])
#expLicit indexing
1 1
If you have indexes like numbers like these In [6]:
import pandas as pan sr = pan.Series([1, 3, 5], index = [2,4,6])
If you want the second element using the implicit index [indexing defined in index parameter] use .Loc [] syntax and using the explicit indexing [ 0,1,2,... ] use . iloc [] syntax In [7]:
import pandas as pan sr = pan.Series([1, 3, 5], index = [2,4,6]) print(sr.loc[4]) print(sr.iloc[l]) 3 3
We can modify or delete the elements using slicing In [9]:
import pandas as pan sr = pan.Series([1, 3, 5], index = [2,4,6])
sr[4] = 7 print(sr)
del sr[6] print(sr) 2 1 4 7 6 5 dtype: int64 2 1 4 7 dtype: int64
99
data operations
•------------------------
Let's say we have a DataFrame like this In [27]:
import pandas as pan sr = pan.DataFrame({'Product':['Biscuit','Cookies'], 'Sales':[227,158]}, index = [1,2]) sr
0ut[27]:
Product Sales
1
Biscuit
227
2
Cookies
158
And want the Sales Column only, so use the
[] syntax In [32]:
import pandas as pan sr = pan.DataFrame({'Product':[1 Biscuit','Cookies'], 'Sales':[227,158]}, index = [1,2]) sr['Sales']
Out[32]:
1
227
2 158 Name: Sales, dtype: int64
or to get the second row only, so use the
.loc [] syntax In [33]:
import pandas as pan sr = pan.DataFrame({'Product':['Biscuit','Cookies'], 'Sales':[227,158]}, index = [1,2]) sr.loc[2] #You can aLso use sr.iLoc[l]
Out[33]:
Product Cookies Sales 158 Name: 2, dtype: object
or to get the sales of cookies only, so use the
.values[] syntax In [37]:
import pandas as pan sr = pan.DataFrame({'Product':['Biscuit','Cookies'], 'Sales':[227,158]}, index = [1,2]) sr.values[l,l]
Out[37]: 158 ________________________________________________________________________________________________
The values are stored as ndarrays, that's why it used slicing similar to that of 2-Dimensional ndarrays
We can delete a whole column from the DataFrame import pandas as pan sr = pan.DataFrame({'Product':['Biscuit','Cookies'], 'Sales':[227,158]}, index = [1,2]) sr
In [41]:
0ut[41]:
Product
Sales
1
Biscuit
227
2
Cookies
158
del sr['Sales'] sr
In [45]:
Out[45]:
Product
1
Biscuit
2
Cookies
but we cannot delete a value In [47]:
import pandas as pan sr = pan.DataFrame({'Product':['BiscuitCookies'], 'Sales':[227,158]}, index = [1,2]) del sr.values[l,l]
ValueError Traceback (most recent call last) in 2 sr = pan.DataFrame({'Product':['Biscuit','Cookies'], 3 'Sales':[227,158]}, index = [1,2]) ------ > 4 del sr.values[l,l]
ValueError: cannot delete array elements
nor you can modify a value In [48]:
import pandas as pan sr = pan.DataFrame({'Product':['Biscuit','Cookies'], 'Sales':[227,158]}, index = [1,2]) sr
0ut[48]: Product
Sales
1
Biscuit
227
2
Cookies
158
In [51]:
sr.values[l,l] = 162 sr.values[l,l]
0ut[51]:
158
101
data operations
More with ndarrays We can reverse a ndarray using [ ::-1] syntax In [56]:
import numpy as npy ar = npy.array([1,2,3,4]) ar
Out[56]:
array([l, 2, 3, 4])
In [55]:
ar = ar[::-1] ar
Out[55]:
array([4, 3, 2, 1])
We can broadcast a whole ndarray without doing it the long way
In [63]:
import numpy as npy ar = npy.array([5,1,3,9]) ar
Out[63]:
array([5, 1, 3, 9])
In [64]:
ar.sort() ar
Out[64]:
array([l, 3, 5, 9])
There are many built-in ndarray methods that will not be discussed now, but will be used in the future lessons in various steps, you may go to the documentation to find all the functions and their roles, as we don't require every function for our data processing and analyzing, all the miscellaneous functions are not discussed in this book
DATA OPERATIONS
Data Cleansing Let's consider a situtation like below In [71]:
import pandas as pan import numpy as npy ar = npy.array([[1,2,3],[4,7,2],[4,9,1]]) df = pan.DataFrame( data = ar, index = [' a ', ' c ', ' e ' ], columns = ['Cl','C2','C3']) df
0ut[71]:
In [72]:
C1
C2
C3
a
1
2
3
c
4
7
2
e
4
9
1
df = df.reindex(['a','b','c','d','e']) df
0ut[72]:
C1
C2
C3
a
1.0
2.0
3.0
b
NaN
NaN
NaN
c
4.0
7.0
2.0
d
NaN
NaN
NaN
e
4.0
9.0
1.0
The reindexed Data Frame has NaN values in the b and d rows. This happened because, there is no data for b and d rows. Using reindexing, we have created a DataFrame with missing values. In the output, NaN means Not a Number. To make detecting missing values easier (and across different array dtypes), Pandas provides the isnuULO and notnullO functions, which are also methods on Series and DataFrame objects
Name: Cl, dtype: bool
103
data operations
Pandas provides various methods for cleaning the missing values. The fillna function can fittna NaN values with non-null data in a couple of ways like replacing NaN values with 0 In [74]:
df.fillna(0)
Out[74]:
C1
C2
C3
a
1.0
2.0
3.0
b
0.0
0.0
0.0
c
4.0
7.0
2.0
d
0.0
0.0
0.0
e
4.0
9.0
1.0
We can copy the value above or below that data using
pad or bfi’L’L in method parameter of fittna function In [75]:
df.fillna( method = 'pad' )
C1
C2
C3
a
1.0
2.0
3.0
b
1.0
2.0
3.0
c
4.0
7.0
2.0
d
4.0
7.0
2.0
e
4.0
9.0
1.0
We can drop the rows with missing values with dropna function In [76]:
df.dropna()
0ut[76]:
C1
C2
C3
a
1.0
2.0
3.0
c
4.0
7.0
2.0
e
4.0
9.0
1.0
If we want to change a single value in a Data Framejwe can use the replace function In [78]:
import pandas as pan import numpy as npy ar = npy.array([[1,2,3],[4,7,2],[4,9,1]]) df = pan.DataFrame( data = ar, index = [' a',' c', ' e' ] columns = ['Cl','C2'>'C3']) df.replace({3:13})
0ut[78]:
C1
C2
C3
a
1
2
13
c
4
7
2
e
4
9
1
1 Q DATA ANALYSIS IO S PROCESSING • Doto Analytics • Correlations between attributes • Skewness of the data
o
A
13 _____ /
data analysis s processing _________________________________________ y
As we learned in the mathematics for machine learning lesson, we need to a lot of analytics or statistics of our data to know more about the data. As we already know central tendency i.e. mean, median and mode are the basic statistics of our data which tells us about the average of the data, 50% or middle value and the most occuring value in the whole data Likewise we will analyze our data and as mentioned earlier we don't need to calculate them manually or through formula's, there's plenty of functions present in different libraries to conduct the analysis
Data analytics
Before training any model we need to check the data and it's details. We will use the ftress.csv' as your data for now. You can get the file either scanning the qr-code or https://defmycode.cf/wp-content/u the link. Make sure to move the file your home directory of lupyter Notebook and then import the csv data. Before doing any further action, let's have a look at our raw data In [2]:
import pandas dt = pandas.read_csv(’trees.csv') dt
Out[2]:
Index
"Girth (in)”
"Height (ft)”
”Volume(ftA3)”
0
1
8.3
70
10.3
1
2
8.6
65
10.3
2
3
8.8
63
10.2
3
4
10.5
72
16.4
4
5
10.7
81
18.8
107
data analysis
a processing
The first analysis is to know the shape of the data or how amny rows and columns are present in the data. We can do so by using the shape attribute of the dataframe object In [4]:
import pandas dt = pandas.read_csv('trees.csv') dt.shape
0ut[4]:
(31, 4)
So our data has 31 rows and 4 columns i.e. 124 values in total. If we want we can just inspect the first 10 values using the head 0 function and passing 10 as argument In [5]:
import pandas dt = pandas.read_csv('trees.csv') dt.head(10)
Index
"Girth (in)"
"Height (ft)"
"Volume(ftA3)"
0
1
8.3
70
10.3
1
2
8.6
65
10.3
2
3
8.8
63
10.2
3
4
10.5
72
16.4
4
5
10.7
81
18.8
5
6
10.8
83
19.7
6
7
11.0
66
15.6
7
8
11.0
75
18.2
8
9
11.1
80
22.6
9
10
11.2
75
19.9
To get a statistical overview of the whole data we can use the describeO function which provides 8 properties i.e. count, mean, standard deviation, minimum value, maximum value, 25% (first interquartile sperator), 50% (median) and 75% (third interquartile seperator)
In [6]:
import pandas dt = pandas.read_csv('trees.csv') dt.describe()
Index
"Girth (in)"
"Height (ft)"
"Volume(ftA3)"
count
31.000000
31.000000
31.000000
31.000000
mean
16.000000
13.248387
76.000000
30.170968
std
9.092121
3.138139
6.371813
16.437846
min
1.000000
8.300000
63.000000
10.200000
25%
8.500000
11.050000
72.000000
19.400000
50%
16.000000
12.900000
76.000000
24.200000
75%
23.500000
15.250000
80.000000
37.300000
max
31.000000
20.600000
87.000000
77.000000
If you want the values rounded-off to say 2 decimal places we can use the pandas set_option() function and specify the precision as 2. We can specify a lot of options through this function
In [7]:
import pandas dt = pandas.read_csv('trees.csv') pandas.set_option('precision',2) dt.describe()
Out[7]:
Index
"Girth (in)"
"Height (ft)"
"Volume(ftA3)"
count
31.00
31.00
31.00
31.00
mean
16.00
13.25
76.00
30.17
std
9.09
3.14
6.37
16.44
min
1.00
8.30
63.00
10.20
25%
8.50
11.05
72.00
19.40
50%
16.00
12.90
76.00
24.20
75%
23.50
15.25
80.00
37.30
max
31.00
20.60
87.00
77.00
'
•/> 109 DATA ANALYSIS S PROCESSING ----- '•--------------------------------
Correlation between attributes The relation between two attributes (feature or label) in a data is called relationship. It is important to know the relations between the attributes. We can do so using the corr() function and using the Pearson's Correlation Coefficient to calculate that. The Pearson's Correlation Coefficiet can be understood by the following: • 1 represents positive correlation • 0 represents no relation at all • -1 represents negative correlation In [2]:
import pandas dt = pandas.read_csv('trees.csv') pandas.set_option('precision' ,2) dt.corr(method='pearson')
Index
"Girth (in)"
"Height (ft)"
"Volume(ftA3)"
Index
1.00
0.97
0.47
0.90
"Girth (in)"
0.97
1.00
0.52
0.97
"Height (ft)"
0.47
0.52
1.00
0.60
"Volume(ftA3)"
0.90
0.97
0.60
1.00
Note that we used the precision of the values as 2 to keep the values rounded-off to 2 decimal places. In the corr() function we specified pearson in the method parameter. As we already know that Girth, Height and Volume of tree are correlated that's why we get the values around 0.5 - 1.0 which represents positive correlationship i.e. if Height is changed the volume will be affected, if the Girth is changed the volume will be affected and vice versa
*
•/ 110 DATA ANALYSIS S PROCESSING ----- '•--------------------------------
Skewness of the data Skewness of a data is the situation when the data appears to have normal distribution but it may be skewed to either left or right. We need the skewness of a data to correct the data during it's preparation. The more the value is close to 0 it is less skewed and more the value is close it -1 or 1 it is more skewed to either left or right side, let's check the skewness of our tress data using the skew() function In [3]:
import pandas dt = pandas.read_csv('trees.csv') pandas.set_option('precision',2) dt.skew()
Out[3]:
index "Girth (in)" "Height (ft)" "Volume(ftA3)" dtype: float64
0.00 0.55 -0.39 1.12
As the index column has values from 1 to 31 it's skewness is 0 i.e. not skewness at all. On the other hand Girth can be said to be skewed to the right side, Height is skewed to the left side and Volume is highly skewed to the right side i.e. beyond 1. While data preparation we must consider the skewness and keep it close as much as possible to 0
Data Processing Before feeding the data to models we need to pre-process the data because the algorithms are completely depended on the data so it must be clean and appropriate as much as possible. While finding skewness we found that you data is skewed i.e. it needs to be closer to 0 for better results, so let's look at some processes to ready our data
• /---------------------------------
Ill
data analysis s processing
Scaling Our data is spread over a wide range with different scales i.e. not suitable to train models. We need to bring our data in a more appropriate scale, we can do so using the MinMaxScalen class and it's fit_transforrn() method of the scikit-learn library. We can scale our data in the range of 0 to 1 which is the most appropriate range for the algorithms In [28]:
import pandas from sklearn import preprocessing dt = pandas.read_csv('trees.csv') ar = dt.values # array # Scoter Object Sclr = preprocessing.MinMaxScaler(feature_range=(0?l)) skl_ar = Sclr.fit_transform(ar) ^Seating # Seated data skl_dt = pandas .DataFrame(skl_ar_, columns=['S.No.','Height','Height','Volume']) skl_dt.round(1).loc[5:10]
Out[28]:
S.No.
Height
5
0.2
0.2
0.8
0.1
6
0.2
0.2
0.1
0.1
7
0.2
0.2
0.5
0.1
8
0.3
0.2
0.7
0.2
9
0.3
0.2
0.5
0.1
10
0.3
0.2
0.7
0.2
Height Volume
You can compare the values with the values beside i.e. unsealed. If you want you can change the range to say 0-100 through the feature_range parameter in MinMaxScaler while the scaler class intialization
S.No.
Girth
5
5
10.7
81
18.8
6
6
10.8
83
19.7
7
7
11.0
66
15.6
8
8
11.0
75
18.2
9
9
11.1
80
22.6
10
10
11.2
75
19.9
Height Volume
*
•/ 112 DATA ANALYSIS S PROCESSING ----- '•--------------------------------
Normalization Normalization is used to rescale each row of data to have a length of 1. It is mainly useful in Sparse dataset where we have lots of zeros. There are two types of normalization process namely LI and L2. With the LI method, all the values in each row will sum upto 1. We can demonstrate the same using the Normalizer class and it's transform method. To use the LI method specify fLl' in the norm parameter of the class In [3]:
import pandas from sklearn import preprocessing dt = pandas.read_csv('trees.csv') ar = dt.values Nm = preprocessing.Normalizer(norm='11') nm_ar = Nm.transform(ar) print(nm_ar[:5]) # Sum of the rows for i in [0,1,2,3,4]: print(sum(nm_ar[i]))
[[0.01116071 0.09263393 [0.02328289 0.10011641 [0.03529412 0.10352941 [0.03887269 0.10204082 [0.04329004 0.09264069 1.0 1.0 1.0 0.9999999999999999 1.0
0.78125 0.75669383 0.74117647 0.69970845 0.7012987
0.11495536] 0.11990687] 0.12 ] 0.15937804] 0.16277056]]
We created the Nm object of the Normalizer class with the normalizing method as 11 in the norm parameter while intialization and normalized our ar data values with the transform method of the Normalizer class and stored it in nm_ar variable. Then we printed the 5 rows of the normalized data values We also created a for loop to print the sum of each row of the normalized data i.e. l(except for the 4th row i.e. 0.99). Note that we didn't per form any rounding-off
In the next method i.e. L2 Normalization, all the squares of values in each row sum upto 1. So let's use f12' in the norm parameter and check their sums In [12]:
import pandas from sklearn import preprocessing dt = pandas.read_csv('trees.csv') ar = dt.values Nm = preprocessing.Normalizer(norm='12') nm_ar = Nm.transform(ar)
print('L2 Normalization\n') print(nm_ar[:5]) # Sum of vaLues in the rows print('\nSum of the values in each row\n') for i in [0,1,2]: print(sum(nm_ar[i])) # Sum of the squares of the vaLues in the rows sm_row = '\nSum of squares of the values in each row\n' for i in [0,1,2,3]: print(sm_row) sm_row = 0 for val in nm_ar[i]: sm_row += val*val
L2 Normalization
[[0.01403589 [0.03012017 [0.04651593 [0.05355175 [0.05953254
0.11649791 0.12951675 0.13644674 0.14057333 0.12739964
0.98251251 0.97890567 0.97683459 0.96393143 0.96442719
0.1445697 ] 0.1551189 ] 0.15815417] 0.21956216] 0.22384236]]
Sum of the values in each row
1.2576160130346932 1.2936614951212533 1.3179514295489714 Sum of squares of the values in each row 1.0 1.0 1.0
*
•/ 114 DATA ANALYSIS S PROCESSING ----- '•--------------------------------
The code may a bit hard to understand because of the for loop but let's try to understand it. We normalize the data as we did we before but this time we used the f12' method and printed the data values. Then we printed the sum of the values of first three rows of the normalized data but they didn't sum upto 1. Next as the L2 method states, we printed the sum of the squared values in the first three rows using for loop which turned out to be exactly 1 Before the for loop, we created an sm_row varaible in which we will add our squared values in the rows but we stored a string at start. Then we created the outer loop in which we will get the index of the rows. Also we entered one more number in the list because at the first iteration the string in the sm_row will be printed and after printing it we changed it's value to 0 and then we create the innner in which we will perform addition. In each iteration of the inner loop we will add the square of each element in the rows with += compound assignment operator. After all the values are sumed up, we return to the outer loop and print it and again revert the value to 0 to store teh sum for the next row until all the values are printed (Sum holder variable (vessel)]
[sm_row] = [’\nSum of squares of.7) for i in [0,1,2,3]: [ ist’Run j[1st Run]
print (sm_row)
L
~—I
1.0 1.0 X_____
for i in [0,1,2,3]: [2nd Run] nT)--------- [print (sm_row)}*—I /[sm_row = 0] Output
_______ z
Binarization In binarization we binarize our data i.e. reduce differences to only two to leave crisp vales with a threshold. For exaple if we set the threshold to 10, all the value in a data set under 10 will be converted to 0 and above 10 will be converted into 1. Let's binarize our data with Binarizer class and transform() method In [21]:
import pandas from sklearn import preprocessing dt = pandas.read_csv('trees.csv') ar = dt.values Nm = preprocessing.Normalizer(norm='11') nm_ar = Nm.transform(ar) Bin = preprocessing.Binarizer(threshold=0.1) bin_ar = Bin.transform(nm_ar) bin_ar[10:16]
0ut[21]:
array([[0., [0., [1-, [1., [1-, [1-,
0., 1-, 1.], 0.t 1., 1-], 0., 1., 1-], 1., 1., 1.], 0., 1.], I-]]) 1.,
As you can see used the normalized (LI) that had a range of 0 to 1 which made things easier to set a threshold which is specified in the threshold parameter i.e. 0.1 So all the values below 0.1 are changed to 0 and all the values above 0.1 are changed to 1
0
'
•/> 116 DATA ANALYSIS S PROCESSING ----- '•--------------------------------
Standardization Standardization or Standard scaling is the method of changing the distribution of data arttributes to Gausiann distribution (Normal distribution). In this mthe mean is changes to 0 and standard devia tion is changed to 1. Let's standardize our data using the StandardScater class and it's fit() and transform() methods In [14]:
import pandas import numpy from sklearn import preprocessing dt = pandas.read_csv(1 trees.csv1) ar = dt.values # Standardizer Std = preprocessing.StandardScaler().fit(ar) std_ar = Std.transform(ar) print(std_ar[0:3])
print(’Mean:1, round(numpy.mean(std_ar)} 2)) print('Std.Deviation:'}round(numpy.std(std_ar)>2)) [[-1.67705098 [-1.56524758 [-1.45344419 Mean: -0.0 Std.Deviation:
-1.60291968 -0.9572127 -1.22883711] -1.50574137 -1.75488995 -1.22883711] -1.44095583 -2.07396086 -1.23502119]]
1.0
While the StandardScater object intialization we also called the fit() function to fit the scaler to our ar array and also transformed it, if you don't call the fit you'll get an error like
This StandardScater instance is not fitted yet. Catt ’fit1 with appropriate arguments before using this estimator you can also use the fit & transform functions in the previous preprocessing methods, for demonstration purpose they aren't used in previous examples but make sure to use them in it's applications Note that we used meant) and std() functions of the numpy package to calcualte the mean and standard deviation i.e. 0 and 1
117 DATA ANALYSIS S PROCESSING ----- '•--------------------------------
Label encoding In many cases our data has more labels (word) than features (numeric) but using words (strings) in processing limits many activities. For that purpose we need to change those labels into numeric notations or features like the following example In [15]:
import pandas from sklearn import preprocessing dt = pandas.DataFrame({'Questions':['A'>'B'> 'C','DE'], ’Answers':[’True’> 'True','False','True','False']}) dt
Out[15]:
Questions
Answers
0
A
True
1
B
True
2
C
False
3
D
True
4
E
False
We can use the Label-Encoder class for label encoding In [17]:
import pandas from sklearn import preprocessing dt = pandas.DataFrame({'Questions':['A','B','C','D','E'], 'Answers':['True','True','False','True','False']}) Enc = preprocessing.LabelEncoder( ) Enc.fit(dt['Answers']) # Encoded LabeLs dt['Answers'] = Enc.transform(dt['Answers']) dt
Out[17]:
Questions
Answers
0
A
1
1
B
1
2
C
0
3
D
1
4
E
0
'
•/> 118 DATA ANALYSIS S PROCESSING ----- '•--------------------------------
As you can see that we had the Questions as A-E and Answers as True or False. But we encoded the Answers label to be 0(False) or l(True) If we want we can get the label for the value or decode the 0 or 1 values using the inverse_transform() function In [18]:
import pandas from sklearn import preprocessing dt = pandas.DataFrame({'Questions':['A'B' C','D'>' E'], ’Answers*:[’True’,'True',’False’,'True','False']}) Enc = preprocessing.LabelEncoder() Enc.fit(dt['Answers']) # Encoded LabeLs dt['Answers'] = Enc.transform(dt['Answers']) # Decoding LabeLs print(Enc.inverse_transform([0,1]))
['False'
'True']
By encoding we can hide the true values and perform a lot operations with them because they are numerical values. In this data we had less only two label values i.e. True and False, but when there are more values the encoding will range from 0 to their respective lengths
Age < 23.0 gini = 0.375 samples = 4 value = [1.0, 0. 3. 0. 0.0] class = Vanilla j
Gender < 0.5 gini = 0.667 samples = 12 value = [2. 6. 0. 0. 2. 0. 2] class = Strawberry
z
Age < 15.5 gini = 0.444 samples = 6 value = [0. 4. 0. 0. 0. 0. 2] class = Strawberry
y
Age< 13.5 gini = 0.667 samples = 6 value = [2. 2. 0. 0. 2. 0. 0] class = Chocolate X,
z
gini = 0.0 samples = 2 value = [0. 2, 0. 0, 0. 0, 0] class = Strawberry
■X Age< 18.0 gini = 0.5 samples = 4 value = [2, 0. 0. 0. 2. 0. 0] class = Chocolate
X.-
s
"X
gini = 0.0 samples = 2 value = [2. 0. 0. 0. 0. 0. 0] class = Chocolate y
Likewise we can use the decision tree to solve different kind of problems based on classification But you may also ask how does the tree creates those comparisions or splits? It isn-’t necessary to know but you should. First the algorithm calculates the gini index for each attribute using the below formula: p2 + q2 which is the sum of the square of probability for success(p) and failure(q). Then the dataset is splitted into two lists of rows having index of an attribute and a split value of that attribute. Then it finds the best possible split by evaluating the cost(gini) of the split
151
CLASSIFICATION •
Logistic Regression Logistic regression is a type of model that predicts the outcome of output values as Yes or no as numeric values 1 or 0 respectively. We can use these type of models to classify a day as rainy or notj a person as healthy or sick, etc. But there are different types of logistic regression used for to different situations.
Binomial Logistic Regression
Binomial or binary logistic regression used to predict exactly two outcomes i.e. either l(positive) or 0(negative) Let's use an dataset to predict whether it will rain or not if the temperature and humidity percent are provided as input. You can download the data set from here Let's import the modules and the https://defmycode.cf/wp-conte dataset together. This time we will import linear_model and train_test_split from sklearn rhprk thp Rpsourrps
import pandas from sklearn import linear_model from sklearn.model_selection import - train_test_split from sklearn.metrics import accuracy_score In [1]:
import pandas from sklearn import linear_model from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score
We also imported the accuracy_score0 function from sklearn. metrics to calculate the accuracy of our model Now we can import our dataset and this time let's view it as it is
152
In [2]:
CLASSIFICATION
import pandas from sklearn import linear_model from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score dt = pandas.read_csv('Rainfall_data.csv') dt
Out[2]:
Unnamed: 0
Temperature
Humidity%
Rain
0
0
34
74.2
Yes
1
1
19
68.2
No
2
2
28
67.2
Yes
3
3
29
66.6
Yes
4
4
26
57.9
Yes
19995
19995
30
77.9
Yes
19996
19996
20
74.8
Yes
19997
19997
14
69.4
No
19998
19998
20
60.6
No
19999
19999
22
64.8
No
20000 rows x 4 columns
As you can see we have 20000 rows and 4 columns worth of data! Now we can move on to a new cell and perform the splitting of the data into train-test input and train-test output In [3]:
Input = dt.drop(columns=['Unnamed: O'^'Rain']) Output = dt['Rain'] inp-X, tst_X, out—y, tst_y = train_test_split( Input.values,Output,test_size=0.01)
We stored the input features i.e. Temperature in Input and the output i.e. Rain (Yes or No) in Output. Then we passed these values to the train_test_sptit0 function and splitted the data into training input, testing input, trainging output and testing output where the test size is 0.01 (1% i.e. 200)
z
< 153
CLASSIFICATION
Now we can create out logistic regression CModel and train it In [4]:
Input = dt.drop(columns=['Unnamed: 0','Rain']) Output = dt['Rain1] inp_Xj tst_X, out_y, tst_y = train_test_split( Input.values,Output,test_size=0.01) CModel = linear_model.LogisticRegression() CModel.fit(inp_Xjout_y)
0ut[4]:
LogisticRegression()
So our model is ready to make predictions, letJs move onto a new cell and let the model predict. Then we will compare the values and print the accuracy score In [5]:
from sklearn import preprocessing pred_y = CModel.predict(tst_X) Enc = preprocessing.LabelEncoder().fit(['Yes'No']) cmp = pandas.DataFrame({'Predicted':Enc.transform(pred_y), 'Actual':Enc.transform(tst_y)}) print('Accuracy Score:',accuracy_score(tst_y,pred_y)) cmp.plot(kind='density')
Accuracy Score: 0.91 Out[5]:
So the model has accuracy score of 0.91 i.e. 91%, which is really good! You can also see the density plot where only 9% of values are predicted wrong by the model
154
CLASSIFICATION
So how did our model predicted teh values or how do the logistic regression works? To understand we will see what is the mathematics behind the algorithm, if you want you can move ahead or give it read. The followings are the steps of linear function of binomial logistics regression: • We already know that the output will be either 0(No) or l(Yes). For that the linear function is basically used as an input to another function such as g in the following relation
h0(x) = g(0Tx) [0
h0 sS 1 ]
gis the logistic or sigmoid function which can be found with the following formula:
where z is 0Tx • We can visualize the sigmoid curve can be understood by the following graph
the classes can be divided into positive or negative. The output comes under the probability of positive class if it lies between 0 and 1. For our implementation, we are interpreting the output of hypothesis function as positive if it is bigger than or equal to 0.5 (>0.5), otherwise negative • We also need to define a loss function to measure how well the algorithm performs using the weights on functions, represented by 6 and h is equal to g(X0):
155
CLASSIFICATION
after defining the loss function our prime goal is to minimize the loss function • It can be done with the help of fitting the weights which means by increasing or decreasing the weights. With the help of derivatives of the loss function with respect to each weight, we would be able to know what parameters should have high weight and what should have smaller weight. The following gradient descent equation tells us how loss would change if we modified the parameters: 60j
=—XT (g(X0) — y) m
Multinomial Logistic Regression
As the name suggest this time we will have to pre dict outputs more than 2 times. In multinomial lo gistic regression we perform classification into 2 or more categories also the categories can be just different types like Rain, Hailstorm, Snow, etc. or ordinal like Heavy rain, moderate rain or low rain fall Let's consider the previous situation where we predicted whether it will rain or not, so let's create a model to predict https://defmycode.cf/wp-content/uplo whether it will rain heavy, moderate or low. You can download the dataset from here and import the modules as we did while creating model to predict the rainfall Chprk the Rpsourcps
data
In [1]:
import pandas from sklearn import linear_model, metrics from sklearn.model_selection import train_test_split
156
CLASSIFICATION
Now we can import our data and preview it without the head() function In [2]:
import pandas from sklearn import linear_model.> metrics from sklearn.model_selection import train_test_split dt = pandas.read_csv('RainfallData.csv') dt
Temperature
Humidity%
Rainfall
0
34
74.2
Low
1
19
68.2
No Rain
2
28
67.2
Moderate
3
29
66.6
Moderate
4
26
57.9
Low
17996
31
89.7
No Rain
17997
21
84.7
No Rain
17998
28
74.7
No Rain
17999
30
78.2
No Rain
18000
34
80.4
Low
18001 rows x 3 columns
We have the same temperature, Humidity percent columns but the rain is classified into No rain, low, moderate and high. Now we can move onto the next i.e. splitting the data In [3]:
Input = dt.drop(columns='Rainfall') Output = dt.drop(columns=['Temperature','Humidity%']) inp-Xjtst-XjOut-Yjtst-y = train_test_split( Input,Output,test_size=0.1)
—
Next we need to scale our Input data (optional) or we may encounter error. We will import preprocessing module and scale our input data. Then we can split our data into training and testing sets and train our model after creating it
157
CLASSIFICATION
In [4]:
from sklearn import preprocessing Input = preprocessing.scale(dt .drop(columns='Rainfall').values) Output = dt['Rainfall'] inp_X,tst_X,out_y,tst_y = train_test_split( Input,Output,test_size=0.2) CModel = linear_model.LogisticRegression() CModel.fit(inp_X,out_y)
0ut[4]:
LogisticRegression()
Our model is trained. Now we can test out model's predictions with actual values. To visualize it we need to use the LabetEncoder and encode the Rainfall labels into numeric values. We will also print the accuracy of our model In [5]:
pred_y = CModel.predict(tst_X) Enc = preprocessing.LabelEncoder().fit(['No Rain', 'Low','Moderate','High']) cmp = pandas.DataFrame({'Predicted':Enc.transform( pred_y),'Actual':Enc.transform(tst_y)}) acc = metrics.accuracy_score(tst_y,pred_y) print('Accuracy:’,acc) cmp.plot(kind='density*) Accuracy: 0.435156900860872
Out[5]:
1], s=300, linewidth=l, facecolors='none') ax.set_xlim(xlim) ax.set_ylim(ylim)
First of all we get the model., ax (axes) and plot_support (to plot support vectors or not) as arguments and parameters. Then we start off with the 2-D graph plot and if we don't pass the axes we will find them using the gca() function. We also find the x axis limit and y axis limit using the get_xlim() and get_ylim() function repectively and store them in xlim and ylim
169 SUPPORT VECTOR MACHINES ----- '•-----------------------------Then we create the grid where or the base of our plot using the xiim and ytim. As we did before we create values for the lines using the tinspaceO function and create the grid using the meshgrid() function and then use the vstackO function to vertically stack the arrays where the values are reshaped using the ravetC) function. We also call the decision_f unction (J to get the valyes for the boundaries and margins Next we use the data to plot the boundaries and margins using the contourO function to draw the lines and specify the linestyles and the other properties using the respective parameters Atlas, we check whether to plot the support vectors or not and plot them if to using the scatter() functions by using the support_vectors_ 0 and 1 indexed values in the array Finally we can plot our data clusters and call the MMH () function and pass our SVC model In [41]:
pyplot.scatter(Xl>X2>c=y) MMH(CModel)
Finally we have the maximum marginal hyperplane plotted for our data clusters with the support vectors
'
•/> 170 SUPPORT VECTOR MACHINES ----- '•------------------------------
Support Vector Machine Kernels Support vector machines are implemented with kernels that transforms a input data space into multidimensional for more flexiblity and smooth workflow for the support vectors machines. As in the previous model we used the linear kernel there are different types of kernels like: • Linear Kerenel, is used when predicting two outcomes • Polynomial kernel, is more generalized version of the linear kernel where the input space is non-linear • Radial Basis Function kernel, is used for SVM? s that maps the input space into infinite dimensions This time we will use the sample dataset provided by the sklearn to understand the different kernels. First of all we need to import our data and prepare it. Import the the followings
import numpy from sklearn import svc,datasets from matplotlib import pyptot In [1]:
import numpy from sklearn import svm., datasets from matplotlib import pyplot
We load the iris (sample iris flower dataset) dataset from the dataset In [2]:
dt = datasets.load_iris() X = dt.data[:, :2] y = dt.target
We imported the dataset and splitted it into input data[:,:2] (elements with 0-2 indexes in alt the arrays) and output target and stored them into X and y respectively
'
•/> 171 SUPPORT VECTOR MACHINES ----- '•------------------------------
Now we need the data to plot the SVM boundaries or input data spaces(different classes). To do so we need to create a grid as we did before. To plot the grid we need minimum and maximum values of the input and output datasets. Then we reshape them using ravet() and pass them to the c_() function which nn particular stacks arrays along their last axis after being upgraded to at least 2-D and stored in the X_ptot as testing input In [3]:
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, l].min() - 1, X[:> l].max() + 1 h = (x_max / x_min)/100 xx, yy = numpy.meshgrid(numpy.arange(x_min, x_max, h), numpy.arange(y_min, y_max, h)) X_plot = numpy.c_[xx.ravel(), yy.ravel()]
We have the required data to train the SVC classifier, so let's create it In [4]:
SvcModel = svm.SVC(kernel='linear',C=1.0).fit(X, y)
We can create the SvcModel using the SVC() function and pass linear to the kernel parameter and 1.©(float) to the regularization C parameter. Also train it using the fit () method Now we can predict the X_plot values and store it in Z. We will reshape it using the reshape() function to the shape of the xx meshgrid First of all we will plot the figure(base). Then we will add a subplot and draw the filled contours using the subplot(121) and contourfO function. We passed the values created with the meshgridO function and Z predicted values. Now we can plot the data clusters using the scatten() plot and finally limiting the x-axis to maximum and minimum values of the xx meshgrid We can see how our dataset is divided intom different spaces by the support vector classifier with linear kernel
In [5]:
Z = SvcModel.predict(X_plot) Z = Z.reshape(xx.shape) pyplot.figure(figsize=(15, 5)) pyplot.subplot(121) pyplot.contourf(xx, yy> Z, alpha=0.3) pyplot.scatter(X[:, 0], X[:, l],c=y) pyplot.xlim(xx.min(), xx.max())
0ut[5]:
(3.3, 8.882727272727251)
Similarly we can create a SVC using the Radial Basis Function kernel. In [10]:
RbfSvc = svm.SVC(kernel='rbf',C=1.0).fit(X, y) Z = RbfSvc.predict(X_plot) Z = Z.reshape(xx.shape) pyplot.figure(figsize=(15, 5)) pyplot.subplot(121) pyplot.contourf(xx, yy, Z,alpha=0.3) pyplot.scatter(X[:, 0], X[:, l],c=y) pyplot.xlim(xx.min(), xx.max())
(Output on the next Page) You can observe both of the plots using the linear and rbf and notice and notice a clear difference in lines and curves
Out[10]:
(3.3, 8.882727272727251)
10
------ j---------- 1------------------------- 1-------------------------1------------------------- 1-------------------------r-
4
5
6
7
8
1 Q CLUSTERING IO ALGORITHM • Clustering • K-Means algorithm • Mean shift algorithm • Heirarchical clustering
o
A
18 k ______/
CLUSTERING i__________________________________________ j
Clustering is a case of unsupervised machine learning. The clustering algorithms learns relations in the data and classifies it into groups according to whether number of groups provided with input or not
Clustering The followings are the different types of clustering: • Density-based, clusters are formed as dense regions. These algorithms have good accuracy and capibility to merge two clusters together. Like, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Ordering Points to identify Clustering structure (OPTICS) • Heirarchial-based, clusters are formed in a heirarchical tree which has Agglomerative (Bottom up approach) and Divisive (Top down approach). Like Clustering using Representatives (CURE), Balanced iterative Reducing Clustering using Hierarchies (BIRCH) • Partitioning, clusters are formed by partioning the objects into k, number of clusters will be equal to that of partitions. Like K-Means • Grid, clusters are formed as grid. This method is fast and independent on the number of objects. Like Statistical Information Grid (STING), Clustering in Quest (CLIQUE)
176
CLUSTERING •
Until now we have calculated the accuracy of supervised learning algorithms with the predicted values and actual values, but how can we do so for unsupervised learning algorithms when we are dealing with unlabeled data? There are some metrics that can be used to evaluate the performance or quality of different unsupervised learning algorithms by the changes in the clusters Silhouette analysis used to check the quality of clustering model by measuring the distance between the clusters. It basically provides us a way to assess the parameters like number of clusters with the help of Silhouette score The silhouette score ranges from -1 to 1. The different numbers represent the followings: • 1 is the situation when the cluster is far away from it's neighbouring cluster • 0 is the situation when the cluster is very close or on the decision boundary itself i.e. seperating the clusters • -1 is the situation when cluster aren't formed correctly the silhouette score can be calculated using the following formula: Silhouette Score = p-q/max(p,q)
where p is the mean distance to the points in the nearest cluster and q is the mean intra-cluster distance to all the points Next we can use the Davis-Bouldin Index to know whether the clusters are well spaced from each other or not and the density of the clusters. We can calculate the DB index using the following formula:
177
CLUSTERING •
where nis the number of clusters, o1 is the average distance of all points in cluster i from the cluster centroid ci. Lower values indicate good performance, where 0 is the minimum value
Dunn index is another metric that can be used to evaluate the performance of a clustering algorithm. It is similar to the DB index but the difference are: • It considers only the clusters close together whereas DB index considers all of the clusters • Lower Dunn indexes indicates bad performance whereas lower the DB index higher the performance of the algorithm The Dunn index can be calculated using the following formula: miniout_y)
Out[5]:
KNeighborsClassifier()
Now we can pass the test values to the KNN classifier and print the accuracy score In [6]:
pred_y = KNN.predict(tst_X) acc = accuracy_score(tst_y,pred_y) print('Accuracy:',acc)
Accuracy: 0.9335
Our model has 93% of accuracy which is more than the logistic regression model by 2%. So this was how we can create KNN algorithms to solve different problems of regression and clustering alike
PERFORMANCE CU 5 METRICS • Calculoting the model • Improving the model • Saving ond loading models
o
/------------------------------------------------------------------\
PERFORMANCE 8 METRICS
So far we have created a lot of models with different algorithms for different tasks like regression, classification, etc. and also evaluated their performance visually through graphs or their accuracy score. In this lesson we will look at the methods to calculate the performance of algorithms
Calculating the model In the maths for machine learning lesson we learned about some methods to calculate error rate, Precision, Recall and F-measure (page no. 71) using confusion matrix. All these values can be used to evaluate the performance of a classifier. Let's use the KNN classifier model we created previously and calculate it's performance First off all let's print the confusion matrix using the confusion_matrix0 function In [8]:
from sklearn import metrics cm = metrics.confusion_matrix(tst_y,pred_y) cm
Out[8]:
array([[ 657, 62], [ 80, 1201]], dtype=int64)
We passed the actual values followed by predicted values. We can visualize it better like In [11]:
from sklearn import metrics cm = metrics.confusion_matrix(tst_y,pred_y) cf = pandas.DataFrame({'True +ve':cm[:,0], 'True -ve':cm[:,1]}, index=['Predicted +ve', 'Predicted -ve']) cf
Out[ll]: Ttue +ve
Ttue -ve
Predicted +ve
657
62
Predicted -ve
80
1201
198
PERFORMANCE 8 METRICS
So we have 657 i.e. True Positives (Predicted Positive values {1, fYes-’, etc.} that are Positive too), 62 i.e. False Positives (Predicted Positive values that are Negative), 80 i.e. False Negatives (Predicted Negative values {0, fNoJ, etc.} that are Positive) and 1201 i.e. True Negatives (Predicted Negative values that are Negative too). Using the confusion matrix we can calculate other metrics like: In [14]:
from sklearn import metrics cm = metrics.confusion_matrix(tst_y,pred_y) cf = pandas.DataFrame({'True +ve':cm[:,0], 'True -ve':cm[:,1]}, index=['Predicted +ve', 'Predicted -ve']) val = (tst_y,pred_y)
acc pre rcl fms
= = = =
metrics.accuracy_score(*val) metrics.precision_score(*val) metrics.recall_score(*val) metrics.fl_score(*val)
print('Accuracy: \acc) print('Precision:',pre) print('Recall:'rcl) print('F-Measure:',fms)
Accuracy: 0.929 Precision: 0.950910530482977 Recall: 0.9375487900078064 F-Measure: 0.944182389937107
We have calculated the accuracy, precision (True positive values predicted by the model from total positive values predicted), recall (True positive values predicted by the model from actual positive values)and f-measure (also known as Fl score) using the accuracy_score(), precision_score(), recaUL_score() and fl_score() functions and printed them respectively respectively We can print all of them together in a tabular form using the ctassification_report() function
199
In [17]:
PERFORMANCE 8 METRICS
from sklearn import metrics
rep = metrics.classification_report(tst_y,pred_y) print(rep) precision
recall
fl-score
support
0 1
0.89 0.95
0.91 0.94
0.90 0.94
719 1281
accuracy macro avg weighted avg
0.92 0.93
0.93 0.93
0.93 0.92 0.93
2000 2000 2000
The support is the number of Positive values in the sample for the feature or label here 0 and 1 i.e. Rain or No Rain. Macro average stands for (macro*score of class 0 + macro*score of class 1 where macro is 0.5 here) and weighted average stands for (weighted score of class 0 + weighted score class 1 where the weight is mostly imbalanced). We can use these metrics to evaluate the performance of a algorithm. Then look at the metrics to evaluate a regression model. We will use the KNN regressor we created to predict In [6]:
from sklearn import metrics
val err mae mse rsq
= = = = =
(tst_y.» pred_y) metrics.max_error(*val) metrics.mean_absolute_error(*val) metrics.mean_squared_error(*val) metrics.r2_score(*val)
print('Max Error:'?err) print('MAE:',mae) print('MSE:',mse) print('R2:',rsq)
Max Error: 19.824399999999983 MAE: 7.167660000000001 MSE: 84.34250813599998 R2: 0.02620416439244655
' •/> 200 PERFORMANCE & METRICS ----- '•--------------------------We calculated the Maximum Error(Maximum residual error), MAE(Mean absolute error i.e. average vertical distance between each point and the regression line), MSE(mean of the squared distance from each point to the regression line) and R2 (Explained variation / Total variation) using the
max_error(), mean_absoLute_error(), mean_squared_error() and r2_score() functions and printed them repectively. The lesser the Max Error, MAE and MSE is the better the performance of the model is. Where R2 is a percentage i.e. more closer to 1.0 is more better. But a constant model like our's that always predicts the expected value of y, disregarding the input features, would get a R2 score closer to 0.0
Improving the model Upon calculating the metrics of an algorithm we can perform the following steps to improve the performance of our models: • Make sure to train the model with adequate data. The dataset shouldn't have abnormal distribution of features or labels like 5 samples of Yes and 95 samples of No • After loading data we should always practice the best and suitable preprocessing methods on our data to improve it's quality like encoding labels • We shouldn't save much data for testing but don't less too. For datasets with samples over 10k, 20% or less is adequate • In cases of very less data you can create random values for testing instead of splitting the already scarce data • You can always test different algorithms to solve a problem, compare them with their metrics and choose the best and improve it
201
PERFORMANCE 8 METRICS
Saving and loading models So we have created our model, tested it and even improved it. Let's say we want to use the model somewhere else or share it so, how to do that? Well, we can do so using the joblib module. So let's save our KNN weight predicting model using
joblib In [6]:
import joblib joblib.dump(KNN, "WeightPred.sav")
0ut[6]:
['WeightPred.sav']
We used the dumpO function and passed the KNN regressor adn the "WeightPred. sav" filename as arguments. Make sure to use the .sav extension after the model name. As we haven't specified any specific location it is stored in the place where jupyter notebook is hosted D sales_data.csv somefile.png
trees.csv WeightPred.sav
Now we can open a new jupyter notebook, import
jobtib and our model In [1]:
import joblib KNN = joblib.load("WeightPred.sav") KNN.predict([[70]])
Out[l]:
array([[133.4852]])
We imported our model using the T_oad() function and passed the saved model name. We also asked the model to predict the weight of a person with 70 inches of height and it passed 133.5 pounds
ML APPLICATION 1 • Movie Recommender
ML APPLICATION 1
□---------------Problem: You have to create a model who will suggest the genre for movies a person likes if the person's age, gender and previously watched movie genre is provided as input. Here is the dataset for sample recommendation: https://defmycode.cf/wp-content/uploads/2020/12/movies.csv
------------ [ data ]------------- '
So the first step is to decide which method to use? If the task is to recommend the genre of a movie, that is classify a person so we will use classification. Next, we need to decide which algorithm to use? We aren't dealing with a huge dataset so we can go with the decision tree classifier So let's start of by importing all the modules we need In [1]:
import pandas from sklearn.tree import DecisionTreeClassifier from sklearn.preprocessing import LabelEncoder
Now we can import the dataset and preview it using the headO function In [2]:
import pandas from sklearn.tree import DecisionTreeClassifier from sklearn.preprocessing import LabelEncoder dt = pandas.read_csv('movies.csv*) dt.head(3)
ML APPLICATION
Out[2]: Age
Gender
Watched
Genre
0
19
Male
Comdey
Mystery
1
19
Female
Romance
Drama
2
19
Male
Romance
Drama
As you can see we need to encode all of the labels into numeric values. We can use an encoder for the Gender labels and another encoder for Watched and Genre labels In [3]:
# Gender Encoder gndr_enc = LabelEncoder() gndr_enc.fit(['Male','Female'])
Out[3]:
LabelEncoder()
We created the gndr_enc Gender encoder and passed the Gender labels to the fit() method. Now we can create the Genre encoder. But before that we need all the unique Genre labels in both Watched and Genre column In [4]:
# Unique Genre Label, extraction watched = dt['Watched'].unique() genre = dt['Genre'].unique() Genres = [*watched] for ele in genre: if ele in Genres: continue else: Genres.append(ele)
First of all we extracted the unique labels from the Watched and Genre cloumns using the uniqueO method. Then we created another list variable and passed the Watched uniques (note that we need a single list i.e. 1-D that's why the watched list is unpacked by the * operator). Using the for loop we added the uniques of the Genre column labels that aren't present in the Genres list
ML APPLICATION
Now we can create our gnre_enc Genre encoder and fit the Genres In [5]:
# Unique Genre Label, extraction watched = dt['Watched'].unique() genre = dt['Genre'].unique() Genres = [*watched] for ele in genre: if ele in Genres: continue else: Genres.append(ele) # Genre Encoder gnre_enc = LabelEncoder() gnre_enc.fit(Genres)
Out[5]:
LabelEncoder()
All the encoders are ready so let's encode the labels in our dataset with them In [6]:
for col in ['Gender*,'Watched','Genre']: if col == 'Gender': # Gender Encode dt[col] = gndr_enc.transform(dt[col]) else: # Watched & Genre Encode dt[col] = gnre_enc.transform(dt[col])
Now we can divide our dataset into input and outputj so let's move onto another new cell because if you re-run the above cell it will cause error because the labels are encoded so when the above cell is executed again the encoder will recieve number insted of labels and cause error so move onto a new cell In [7]:
X = dt.drop(columns='Genre') y = dt['Genre'] CModel = DecisionTreeClassifier() CModel.fit(X,y)
0ut[7]:
DecisionTreeClassifier()
•
206
ML APPLICATION
We also have trained our CModel, and use it to make predictions. So let's create a function to pass the values and return the Genre label In [8]:
def recommend(age=18.,gnd=0,watched=0Jtest=False): # Getting input is testing if test: age = int(input("Age:")) gnd = int(input("Gender:")) for g in Genres: print(g, *gnre_enc.transform([g])) watched = int(input("Watched:")) # Ask the model, for recommendation pred = CModel.predict([[age,gnd,watched]]) # Decoding the prediction to LabeL rec = gnre_enc.inverse_transform(pred) return rec[0]
So we defined a recommend() function and defined four parameters i.e. age by default 18, gnd gender by default ©(female), watched genre of the previously watched movie by default O(Comedy) and test by default Fatse which we can use during testing to pass the input values Then if we pass test as True then the function will ask for our input and also display the encoded values for each genre. Then the model will predict using the input values. We will take the output(encoded value) and decode it and finally return it So let's move onto a new cell and call our recommend () function and specify the True for the test parameter In [*]:
recommend(test=True) Age:|18
You can see we are prompted to the input prompt called in the recommend 0 function. So let's pass 18 as the age
|
•
207
In [*]:
ML APPLICATION
recommend(test=True)
Age:18
Gender:
1
Pass the Gender as l(Male) In [*]:
recommend(test=True)
Age:18 Gender:1 Comedy - 0 Romance - 5 Horror - 3 Mystery - 4 Drama - 1 Fantasy - 2
Watched:[0
— You can see the function has displayed all the encoded values for each genre, so let-’s pass 0(Comedy) In [9]:
recommend(test=True)
Age:18 Gender:1 Comedy - 0 Romance - 5 Horror - 3 Mystery - 4 Drama - 1 Fantasy - 2 Watched:0 Out[9]:
'Mystery'
Now we get the Mystery as the recommendation for the 18 years-old Male who has watched a comedy movie previously. Well because we have very less data, so letJs check the answer visually using the dataset
•
208
ML APPLICATION
Out[2] : Age
Gender
Watched
Genre
0
19
Male
Comdey
Mystery
1
19
Female
Romance
Drama
2
19
Male
Romance
Drama
In the second run we printed the first three rows of our dataset and by looking we can say that if a 18 years old male (whose sample isn't present in the dataset) have previously watched a Comedy movie so most likely he'll like a Mystery movie too along with the Comedy movies So we have created our Movie Recommender Model using very little dataset, now it's up to you to test the model or even take opinions from your relatives and recommend them using the model!
ML APPLICATION 2 • Advertisement handling
J
Problem: You have to create a model to decide whether to show an ad to a user or not. If yes then which Car or Insurance advt. where the age of the user and user class i.e. a group provided from another model based on the user's past search results are provided as input. You are provided with the following dataset https://defmycode.cf/wp-content/uploads/2020/12/advertisement.csv
Once again we need to decide which method to use and clearly this is a problem of classification. We can use the KNN classifier algorithm for this task
So let's move onto jupyter notebook and import the algorithm, LabelEncoder, pandas library and the dataset In [1]:
import pandas from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import LabelEncoder dt = pandas.read_csv(1 advertisement.csv’) dt.head(3)
Age
Search
Ad
0
18
Cars
Car
1
19
Automobiles
None
2
21
Automobiles
None
211
ML APPLICATION
We again have very less data to work with. We have the Age and Search as input and Ad as output. But we need to preprocess the labels In [2]:
Enc = LabelEncoder() Enc.fit(['Cars','Automobiles’,'Health','Car', 'Insurance','None']) for col in ['Search','Ad']: dt[col] = Enc.transform(dt[col])
So we encoded the values in the Search and Ad column using the Enc Encoder. Now we can move onto the next step of data splitting but as mentioned earlier we don't have enough data for splitting it into training and testing set. So we need to create the training set ourselves . Let's take a look at the whole dataset: We have 20 rows and 3 columns worth of data. We already know the input i.e. Age and Search and the output i.e. Ad where the input Search will be provided by another classifier which will classify the user into Cars, Automobiles, Health and None classes on the basis of previous searches. We can visually create a understanding from the data that a person of age 18-28 should only be shown Car advertisement when the person is in Cars class else the Insurance advertisement when the person is in Health class. Similarly, a person of age 30 or more should be shown the Car advertisement if the person is in Cars or Automobiles class and vice versa
Age
Search
Ad
0
18
Cars
Car
1
19
Automobiles
None
2
21
Automobiles
None
3
22
Cars
Car
4
23
None
None
5
26
None
None
6
27
Health
Insurance
7
28
Cars
Car
8
18
Health
Insurance
9
19
None
None
10
20
None
None
11
22
Automobiles
None
12
26
None
Insurance
13
30
Health
Insurance
14
30
Automobiles
Car
15
29
Cars
Car
16
29
Health
Insurance
17
29
None
Insurance
18
32
Automobiles
Car
19
32
Health
Insurance
' 212•/ -
ML APPLICATION
>
And person from the 18-24 in the None class should be shown nothing but if the person is 25 or more the Insurance ad should be shown All this assumed conditions are called hypothesis. So using these hypothesis we can create a testing dataset with close to accurate outputs. So let's create them In [3]:
import numpy X = dt.drop(columns='Ad') y = dt['Ad']
def en(val): v = Enc.transform([val]) return v[0]
tst_X = numpy.array( [[21.»en( 'None')]_, [21,en( 'Health')], [27,en('None')],[23,en('Automobiles')], [34,en('None')],[34,en('Automobiles')]]) tst_y = numpy.array( [[en('None')]}[en('Insurance')]} [en('Insurance')],[en('None')], [en('Insurance')],[en('Car')]])
First of all we imported the numpy package to create our test data. Then we divided our dataset into training input and training output To create the testing set we will use the en() function which will take the Search or Ad label and return the encoded value for it. Now we can create some testing data based on our hypothesis, like a 21-years old person of class None ([21,en( rNone')]) should be shown no advertisements ([en( rNone')]) As mentioned earlier the training set is built upon hypothesis. They may be correct or wrong. We created them for the purpose of testing our model. We can only use this in situations like these where the data is compressed into a small dataset. We can visualiza the dataset using the pandas data frame
•
213
In [4]:
ML APPLICATION
def dec(val): v = Enc.inverse_transform(val) return v tst = pandas.DataFrame({'Agetst_X[:0], 'Search':dec(tst_X[:,1]), 'Ad':dec(tst_y[:,0])}) tst
Age
Search
Ad
0
21
None
None
1
21
Health
Insurance
2
27
None
Insurance
3
23
Automobiles
None
4
34
None
Insurance
5
34
Automobiles
Car
We defined dec() function to minimize our code. The ages in the testing set are not present in the actual dataset. Now we can move onto creating our KNN classifier and training it In [5]:
KNN = KNeighborsClassifier() KNN.fit(X,y)
Out[5]:
KNeighborsClassifier()
Now we can pass the test input to KNN classifier and print the accuracy In [6]: from sklearn.metrics import accuracy_score pred_y = KNN.predict(tst_X) accuracy_score(tst_y, pred_y) Out[6]:
0.8333333333333334
So our model has an accuracy of approx 83% and given the number of testing set length i.e. 6, our model has predicted correct for 5 inputs but wrong for only one
•
214
ML APPLICATION •
But remember that the testing set is based upon the hypothesis so maybe what the model predicted is right Also I don't know you have wondered about it until now or not but you can see that the training set was based upon the hypothesis we created i.e. we analyzed the data, found connections in the features & labels and created the testing set of which the model is thinking the same as us for the 5 inputs. All the hypothesis we built are the same patters and relations used by the model to predict. Even though we have done that just having a thorough look at the data which is most likely not to be wrong, but the model does everything in some milliseconds. Think that there were tens of thousands of data like these! Could you have done the same there? I hope your understanding about 'machine' and 'learning' in machine learning is more clear now
OQ CO
ML APPLICATION 3
• Checking wine quality
o
Problem: In a wine factory, you are asked to rate the quality of the production in a scale of 1 to 5 if different chemical properties are passed as an input for the following produced batch and then tell whether the batch is good or not. A good scale is more than half (2.5). You have the following sample dataset of some 1500 samples /—C Input
and Output Values
https://defmycode.cf/wp-content/uploads/2020/12/wine_batch.csv https://defmycode.cf/wp-content/uploads/2020/12/wine
■{ batch]-
X.
sample
We can use machine learning models to solve the problem but the question is which algorithm to choose? If you are think of using an classifier algorithm because we need to rate the wine then of course you're wrong. Rating is to be done in a scale of 1 to 10 where the rating can be 5 or 5.5 or even 5.45, so for this problem we are going to use linear regression So let's our sample dataset and preview it with the describeO function along with the other neccesities In [1]:
import pandas import numpy from sklearn import metrics from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split dt = pandas.read_csv('wine_sample.csv') (dt.describe()).round(1)
217
ML APPLICATION
Out[l]: fixed acidity
volatile acidity
citric acid
residual sugar
chlorides
count
1499.0
1499.0
1499.0
1499.0
1499.0
mean
8.4
0.5
0.3
2.5
0.1
std
1.7
0.2
0.2
1.4
0.0
min
4.6
0.1
0.0
0.9
0.0
25%
7.2
0.4
0.1
1.9
0.1
50%
8.0
0.5
0.3
2.2
0.1
75%
9.3
0.6
0.4
2.6
0.1
max
15.9
1.6
1.0
15.5
0.6
free sulfur dioxide
total sulfur dioxide
density
PH
sulphates
alcohol
quality
1499.0
1499.0
1499.0
1499.0
1499.0
1499.0
1499.0
15.6
46.8
1.0
3.3
0.7
10.4
5.6
10.5
33.3
0.0
0.2
0.2
1.1
0.8
1.0
6.0
1.0
2.7
0.3
8.4
3.0
7.0
22.0
1.0
3.2
0.6
9.5
5.0
13.0
38.0
1.0
3.3
0.6
10.1
6.0
21.0
63.0
1.0
3.4
0.7
11.1
6.0
72.0
289.0
1.0
4.0
2.0
14.9
8.0
In this dataset we have twelve columns which have about 1500 samples. The first eleven columns are different chemical properties of wine i.e. input and quality is the rating i.e. output. The minumum rating is 3 and the maximum is 8. But we need to rate the quality of wine in a scale of 1 to 5. So we need to Rescale the quality feature in the scale of 1 to 5 and we will do that using the
MinMaxScater In [2]:
from sklearn.preprocessing import MinMaxScaler Sclr = MinMaxScaler(feature_range=(l, 5)) qal = numpy.array(dt['quality']) dt['quality'] = Sclr.fit_transform(qal.reshape(-l,l))
•
218
ML APPLICATION
We imported the MinMaxScater and created our Sctr object of the class. We passed the scale in the feature_range parameter i.e. 1-5. Then we created a numpy array of the quality feature. Then we scaled the data using the fit_transform() function. Note that we passed the reshaped array using the reshape(-1,1) function which will convert the 1-D array [5,6,7,...] to 2-D array [[5],[6],[7],...] Now we can split the data, create our linear regressor and train it In [3]:
X = dt.drop(columns='quality') y = dt['quality'] trnX,tstX,trnY,tstY = train_test_split(X,y,test_size=0.1) Reg = LinearRegression() Reg.fit(trnX,trnY)
Out[3]:
LinearRegression()
Before checking the quality of the given batch we need to test our data and find some metrics. So let's use the testing sets and compare the model's predictions In [4]:
= Reg.predict(tstX) metrics.mean_absolute_error(tstY,predY) metrics.max_error(tstY,predY) pandas.DataFrame({'Predicted':predY, 'Actual':tstY.values}) print ('MAE: ’,mae, '\n','Max RE: ’,err) cmp.plot(figsize=(7.5,6))
predY mae = err = cmp =
MAE: 0.43020099299063136 Max RE: 1.5037184643992099
Our model has MAE (Mean absolute error) of approx 0.43 i.e. the average aboslute errors with the maximum residual error i.e. Max error as 1.5 We have a lot of values so let's plot the graph for comparing the values
•
219
0ut[4]:
ML APPLICATION
cmatplotlib.axes._subplots.AxesSubplot at 0x23bl0ef1160>
Looking at the data we can tell how our model is performing. By observing the graph we can tell that our model didn't rated 5 to any input whereas the actual values have only 3 times which explains everything. The distribution of higher values is low therefore prediction of higher rating is also low. Although our model is fine, so let's import the batch dataset and pass it to the model In [5]:
batch = pandas.read_csv('wine_batch.csv1) batch_pred = Reg.predict(batch.values) batch_pred.mean()
Out[5]:
3.135502047867703
We imported the csv file and passed the values for predictions. And at average the rating is 3.14 and if we take MAE(0.43) the rating could be also 2.71 or 3.57. But in all of the cases the average rating of the batch is higher than 2.5 so it is fine!
p/l CH
ml
APPLICATION 4
• Motch ploy decision