Machine Learning Made Easy Using Python 9798584267551


517 106 75MB

English Pages [234] Year 2020

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Preface
Table of Contents
Chapter 1
Chapter 2
Chapter 3
Chapter 4
Chapter 5
Chapter 6
Chapter 7
Chapter 8
Chapter 9
Chapter 10
Chapter 11
Chapter 12
Chapter 13
Chapter 14
Chapter 15
Chapter 16
Chapter 17
Chapter 18
Chapter 19
Chapter 20
ML Application 1
ML Application 2
ML Application 3
ML Application 4
ML Application 5
Resources
Recommend Papers

Machine Learning Made Easy Using Python
 9798584267551

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

PYTHON MACHINE LEARNING Machine learning is one of the skills considered must for the future yeors. As tosks are increasing it has became time consuming to program, machine learning allows machines to learn on their own and produce the same results. Machine learning allows machines to learn on their own through feeding them data. I assume that you have prior knowledge of Python programming and data science, if you don't you can check these books on the next page. A complete guidebook for anyone who wants to master machine learning with Python.

Rahul Mula

Mochine Learning with Python by Rahul Mula © 2020 Machine Learning with Python All rights reserved. No portion of this book may be reproduced in any form without permission from the copyright holder, except as permitted by U.S. copyright law. Cover by Rahul Mula. All the programs written in this book are tested and verified by the author. Cover Template from freepik.com ISBN : 979-8-58-426755-1

/----------------------------------------------------------------------------------------- X

MAKE SURE TO CHECK THEM OUT

Python For Beginner

A beginners guide to programming with python

Data Science with Python

Learn how to perform tasks like data processing, cleansing, analysis and visualization

Why should you learn machine learning? or what are its uses? Would be the questions that may come to your mind. The answer is simple, think that you are given a data from a online store about its products and recommend products. The data has a product name, its category, its quantity, and rate columns with several hundred rows of products. If you want to perform some analysis like the product which is most purchased in that day, it will take a lot of time to do it manually. To ease up these tasks, we use data analysis, i.e. we run a program with codes to perform a certain data analysis. The computer runs the program and we get the output in just few seconds. Then we classify the user and suggest it products based on preferable categories on the basis of it's previous search results.So, How to do that? Well, we need to learn Data Science and Machine Learning to perform those tasks. Businesses S organizations are trying to deal with it by building intelligent systems using the concepts and methodologies from Data science, Data Mining and Machine learning. Among them, machine learning is the most exciting field of computer science. It would not be wrong if we call machine learning the application and science of algorithms that provides sense to the data. This book is prepared especially for beginners (at Data Science and Machine Learning), but you should

be familiar with programming in Python. We will work with packages and modules like NumPy, SciPy, Pandas, Matplotlib, Scikit Learn, etc. to perform analysis and other tasks. I kept this book open to the basic concepts of data science to help the beginners to understand everything but the book only covers data science concepts prior for machine learning, as the name suggests, the book is not for you if you're looking for data science, you check the other books page to find that. I also included advanced topics to not limit you to the basics. Machine learning, algorithms, data science, etc. moy seem tough end boring, but as you handle more end more data, you'll ploy with it!

(contents)

03 CHAPTER

06

CHAPTER

PANDAS

pandas

• Features of Pandas Library • Series • Data Frames

MATPLOTLIB l

• Features of matplotlib

matpMib . Data visualizationp

• PyPlot in matplotlib

1

1

(contents) SCIKIT LEARN

k

• Features of Scikit-learn library • How to work with data? • Why use Python?

CHAPTER

08 CHAPTER 1APTER

k

CHAPTER

k

b?

O

TYPES OF MACHINE LEARNING • Supervised learning • Unsupervised learning • Deep learning

1

SCIKIT LEARN ALGORITHMS

1

CHAPTER

• Regression algorithm • Classification algorithm • Clustering algorithm

CHAPTER

• Importing CSV data • Importing JSON data • Importing Excel data

IMPORTING DATA

r

DATA OPERATIONS

CHAPTER V

1

MATHEMATICS FOR MACHINE LEARNING

IDIID • Data instances I I I • Statistics • Probability

I I I I I

1

• NumPy operations • Pandas operations • Cleaning data

I 1

(contents) DATA ANALYSIS 8 PROCESSING

CHAPTER

k

14

CHAPTER

r

16

CHAPTER

• Data analytics • Correlations between attributes • Skewness of the data

DATA VISUALIZATION • Plotting data • Univaritae plots • Multivariate plots

CLASSIFICATION • Decision tree • Linear regression • Naive Bayes

1 1

1

(contents)

20 CHAPTER

PERFORMANCE 8 METRICS • Calculating the model • Improving the model • Saving and loading models

(contents)

MACHINE LEARNING U1 INTRODUCTION • What is Machine Learning? • Use of Machine Learning • How Machines Learn?

o

(— QJ.Dr--------------------'--- '□--------------------

A

MACHINE LEARNING INTRODUCTION y

What is Machine Learning? //

Data is what you need to do ANALYTICS, Information is what you need to do BUSSINESS. Commonly referred to as the “OiL of the 21st century" our digital data carries the most importance in the field. It has incalculable benefits in business, research and our everyday lives. Machine Learning is the field of computer science where machines provide meaning to the data like we humans do. Machine Learning is an type of artifical intelligence which finds patterns in raw data through various algorithms and perform predictions like humans. Machine Learning also means machines learn on their own. To better understand it think of a new born child and refer it to a machine learning model. The parents cannot teach them everything that's why they leave them to schools which can be you in this case with the machine learning model. The school has text books, tests, etc. to help you learn on your own which

w data

'

•/ ]_4 MACHINE LEARNING INTRODUCTION

'

)

Uses of Machine Learning Organizations are investing heavily in technologies like Artificial Intelligence, Machine Learning and Deep Learning to get the key information from data to perform several real-world tasks and solve problems. We can call it data-driven decisions taken by machines, particularly to automate the process. These data-driven decisions can be used, instead of using programing logic, in the problems that cannot be programmed inherently. The fact is that we can't do without human intelligence, but other aspect is that we all need to solve real-world problems with efficiency at a huge scale. That is why the need for machine learning arises. Followings are some of it's applications in real world: Forecasting weather of a day beforehand through finding patters of weather in the data of weather of previous days

Predicting the future prices of stocks in stock market

suggesting a product to a customer in an oniine store according to the users previous search terms

'

•/ J.5 MACHINE LEARNING INTRODUCTION

' '

How do machines learn? So what magic happens that machines learns like us and perform tasks? Let's understand that by an example. Let's say you are a new computer dealer. You have very basic experience in it. So you ask another dealer and obtain information. You summarized the following points to be important like the processor cores, ram and gpu. The dealer tells you these and you learn in return. Then you are provided with following data about 8gb ram of different brands: 3500 „ 3000 o’ c I 2500

60

65

70

75

80

85

Price

RAM

frequency

By observing the data we can tell that the price increases with the increase in frequency speeds. You understand an simple logic behind the data. Then if you get ram with the following specifications you can tell it's price, like a ram with 1666 Mhz of frequency so its price is 60.

'

•/ 16 MACHINE LEARNING INTRODUCTION

'

) But what if you get an frequency that you don't have record of like 2600Mhz? Then you have to learn how to decide the prices. We start to find a way to calculate with the given data. We assume that their is a linear relationship between the two. We define the relationship as a straight line as shown below:

Price

RAM

frequency

Now we can use the line as reference and predict values. SOj for 2600Mhz the cost will be about 77 So how do we draw out the line. We follow the formula cost - a + b * Mhz, but what are a & b? a and b are parameters of the straight line which you don't need to sweat about.

'

'

•/

]_7

MACHINE LEARNING INTRODUCTION

Likewise in machine learning, the machine i.e. computer learns the patters or relations in the data through algorithms and predict values when new value is asked. So will there be no errors? Definitely it will predict wrong than the actual answer. We also do many mistakes but learn from our mistakes or change our tutor if the result stays negative. Machine learning models too learn from their mistakes and change algorithms when results are not improved.

■x SETTING-UP ENVIRONMENT • Installing Anocondo • Jupyter notebook • Working with Jupyter notebook

A

021/ X__

SETTING-UP ENVIRONMENT

__________________________________ J

Installing Anaconda Head to anaconda.com/products/individuat to download the latest version of Anaconda.

Anaconda Installers Windows ■■

MacOS «

Linux A

64-Bit Graphical Installer (466 MB)

64-Bit Graphical Installer (462 MB)

64-Bit (x86) Installer (550 MB)

32-Bit Graphical Installer (397 MB)

64-Bit Command Line Installer (454 MB)

64-Bit (Power8 and Power9) Installer ( MB)

You can download the anaconda-installer for your system, whether it is Windows, Mac or Linux. After installing it, just run the installer and install

Search the web

Anaconda Prompt (anaconda3)

P anaconda Prompt (anaconda3) - See

App

web results

P anaconda prompt anaconda3 CT

yP anaconda prompt anaconda3 conda

yP anaconda prompt anaconda3 uninstall

Open

c0 Run as administrator

>

Open file location “P3 Pin to Start

Pin to taskbar ®

P anaconda Prompt (anaconda3)

Uninstall

*

•/

20

SETTING-UP ENVIRONMENT

---------

This is the Anaconda Command Prompt, from where we can run programs or perform other operations using code's as commands. Anaconda Prompt (anaconda3)

(base) C:\Users\Rahul>

Jupyter Notebook

jupyter

The lupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. We will use it to perform our data processing, analytics and visualization, etc. on the go. To open lupter Notebook, write jupyter notebook in the anaconda command prompt and press enter. 5 Anaconda Prompt (anaconda3)

(base) C:\Users\Rahul>jupyter notebook



>

21 -----

SETTING-UP ENVIRONMENT

1

fS’ Anaconda Prompt (anaconda3) - jupyter notebook

(base) C:\Users\Rahul>jupyter notebook [I - JupyterLab extension loaded from C:\Users\Rahul\anaconda3\lib\site-packages\jupyterlab NotebookApp] NotebookApp] JupyterLab application directory is C:\Users\Rahul\anaconda3\share\jupyter\lab [I NotebookApp] Serving notebooks from local directory: C:\Users\Rahul [I NotebookApp] The Jupyter Notebook is running at: [I NotebookApp] http://localhost:8888/?token=bda34ad58a2f2015a03f835d458f95010541adf58866ccf7 [I NotebookApp] or’http://127.0.9.1:8888/?token=bda34ad58a2f2015a03f835d458f95010541adf58866ccf7 [I NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). [I NotebookApp] [C

To access the notebook, open this file in a browser: file:///C:/Users/Rahul/AppData/Roaming/jupyter/runtime/nbserver-12332-open.html Or copy and paste one of these URLs: http://localhost:8888/?token=bda34ad58a2f2015a03f835d458f95010541adf58866ccf7 or http://127.0.0.1:8888/?token=bda34ad58a2f2015a03f835d458f95010541adf58866ccf7

J

Anaconda will redirect you to your browser, [it may ask you, in which browser to host your jupyter notebook if you have more than one browsers] a new tab will appear with your Jupyter notebook hosted. You can host your Python files here, and also run the code on the fly. P’ jupyter Files

Running

Clusters

Select items to perform actions on them.

o

|

|

-

to/

Upload Name ♦

5 days ago

Co anaconda3

4 days ago

Co ansel

5 days ago

Co Contacts

5 days ago

Co Creative Cloud Files

4 days ago

Co Desktop

3 days ago

Ca Documents

a day ago

Ca Downloads

10 minutes ago

Ca Favorites

5 days ago

Ca Links

5 days ago

Ca OneDrive

New »

Last Modified

C3 3D Objects

Ca Music

Logout

Quit

C

File size

a day ago

an hour ago

Ca Pictures

5 days ago

Ca Saved Games

5 days ago

Ca Searches

5 days ago

Ca Videos

4 days ago

S Newipynb

4 days ago

In Jupyter Notebook, we donJt need to install any other module, package or library externally everything we need is already present here and the best thing is that you can code online without installing any IDE or the Python Interpreter, which makes it the best choice for data scientists.

72 B

•/ ' 22 DATA SCIENCE INTRODUCTION —J

'

Working with Jupyter Notebook To start coding, click on New and select Python 3 to open a new Python file. jupyter Files

Quit

Clusters

Running

Select items to perform actions on them.

□ 0

-

Upload

to /

5 days ago

□ Ca anaconda3

4 days ago

□ CJ ansel

5 days ago

□ Ca Contacts

5 days ago

Ca Creative Cloud Files

Quit

Running

-r

Logout

Clusters ______ I I —I I_____ Create a new notebook with Python 3

Select items to perform actions on them. □ 0

*/

Name 4

:e

Python 3

□ Ca 3D Objects □ Ca anaconda3

Text File

□ Ca ansel

Folder Terminal

□ Ca Contacts

4 days ago

□ Ca Creative Cloud Files

This is the place where we will write our code [in the cell] and run it. JUpyter

C

File size

4 days ago

C jupyter Files

New»

Last Modified

Name

□ O 3D Objects



Logout

t*

Untitled Last Checkpoint: a few seconds ago (unsaved changes)

Visit repo

Kernel

Trusted

O GitHub

% Binder

Copy Binder link | Python 3 O

Memory: 168/2048 MB

If you cannot create new file or encounter any error, you can head directly to jupyter.org/try and choose Python. Try Classic Notebook

Try JupyterLab

Try Jupyter with Julia

jupyter

A tutorial introducing basic features of Jupyter notebooks and the I Python kernel using the classic Jupyter Notebook interface.

JupyterLab is the new interface for Jupyter notebooks and is ready for general use. Give it a try!

A basic example of using Jupyter with Julia.

23

SETTING-UP ENVIRONMENT

--------------- '•

'

We can rename our file, by clicking the name [untitled] jupyter Untitled Last Checkpoint: a few seconds ago Edit

File

E

®

+

I

View

ft

Kernel

Insert

Cell

*4-

H Run



(unsaved changes)

Help

Widgets

»

C

| Python 3 O

Trusted

v

Code

Q

i Download

A

A

O GitHub

% Binder

Memory: 168/2048 MB

In [ ]: Q

We have only one code cell, in this cell we will write our code jupyter New Last Checkpoint: 2 minutes ago

3

B

View

Edit

File +

ft

»:

ft

Insert

Cell

♦4’

H Run

Widgets

C

» |code





(autosaved)

Kernel

Help

Visit repo

Copy Binder link

Trusted

v|

D

± Download

A

A

O GitHub

% Binder

| Python 3 O

Memory: 219/2048 MB

In [ ]:

There are three type cells - code cells, markdown cells and raw cells. We can use markdown cells to display headings or titles. New Last Checkpoint: 7 minutes ago (unsaved changes)

J U py ter

B

View

Edit

File

®

+

t

I

Insert

Cell

+4'

H Run

Kernel ■

Widgets

Markdown

C

Trusted

Help

v

o

i Download

H

d

O GitHub

% Binder

✓ | Python 3 O

Memory: 219/2048 MB

# Jupyter Notebook

Now run the cell by clicking the run button on the header. New Last Checkpoint: 7 minutes ago (unsaved changes)

^jupyter

Edit

File

B

+

View



C

Insert ♦

4

Cell

Kernel

H Run | ■

C

Widgets »

Jupyter Notebook I" [ ]:

Code

Trusted

Help

v

ra

i Download

41

d

O GitHub

% Binder

✓ | Python 3 O

Memory: 219/2048 MB

•/ 24 SETTING-UP ENVIRONMENT -----

*

In code cells, we can write Python codes and execute them instantly. 3

jupyter New Last Checkpoint: 12 minutes ago View

Edit

File

B

3^

+

C

I?)

Cell

Insert



*

H Run

Kernel ■

Help

Widgets

Trusted

v

Code

»

C

Visit repo

(autosaved)

Q

± Download

a a

O GitHub

Copy Binder link

✓ | Python 3 O

Memory: 119/2048 MB

% Binder

Jupyter Notebook In [ ]: 123*525|

New Last Checkpoint: 13 minutes ago (unsaved changes)

^.JUpyter

B

View

Edit

File

H Run

4>

+

®

+

Cell

Insert

Kernel

Widgets

C



Trusted

v

Code

H

Visit repo

Help E3

i Download

a

a

O GitHub

Copy Binder link

| Python 3 O

Memory: 119 / 2048 MB

% Binder

Jupyter Notebook In [1]: 123*525 Out[l]:

64575

To insert a new cell below the selected cell, press b on your keyborad or click the + icon. 3

jupyter New Last Checkpoint: 17 minutes ago Edit

File

B

View ®

+

Insert *

6

4-

Cell

H Run



C

►*



(unsaved changes)

Widgets

Kernel

Help

Trusted

*

Code

Visitrepo

a

± Download

a

a

O GitHub

Copy Binder link

| Python 3 O

Memory: 119/2048 MB

% Binder

Jupyter Notebook In [1]: 123*525 Out[l]: 64575

In [ J:

You can select [blue] or edit [green] a cell, by clicking outside the text feild or inside the text feild respectively. 3 jupyter New Last Checkpoint: 19 minutes ago File

a I +

Edit 3-

View ®

t

Insert +

4-

Cell

H Run



c

H

Jupyter Notebook

•®

(autosaved) Widgets

Kernel

code

Trusted

Help

£ Download

a

a

O GitHub

% Binder

| Python 3 O

Memory: 119/2048 MB

SETTING-UP ENVIRONMENT

We have more access to the markdown cells, to diplay texts more gracefully. We can add headings, sub-headings and lower-headings, using # 1, 2 and 3 times followed by space and then text respectively.

Jupyter Notebook IPython Data In [ ]:

We can create ordered and bulleted lists 1. Data 1. 2. 3.

Science Python Jupyter Notebook Libraries

1. Data Science A. Python B. Jupyter Notebook C. Libraries

z - Data Science ------

Ordered List ____________ J -----------------

\

* Python * Jupyter Notebook * Libraries • Data Science ■ Python ■ Jupyter Notebook ■ Libraries

BuLLeted List _____________ d

To create an ordered list, use 1 for the first list item and then use tabspace for the sub-list items and use correct numbering. [The text should be written followed by space after the numbers] To create a bulleted list, use - for square bullets and * for round bullets, and same manner as above for list and sub-list items.

*

•/ 26 SETTING-UP ENVIRONMENT

-- '•

We can also links, using [] & (). Write the display text in [] and put the link in (), you can also add a hover text inside of ( ) using " ” quotes. TJupter Notebook for Pvthonl(httDs://iupvter.ore/trv "Try it!") Jupter Notebook for Python

We can also use **** or

to render bold text and ** or __ to render italicized text

We can also insert images by going to the Edit>Insert Image and browse your image to enter it File

Edit

View

Insert

Cell

Kernel

Widgets

Trusted

Help

| Python 3

Create tables using | and strictly following the below example |Product|Price|Quantity| |----------- 1-— |-.............. | |Biscuits|5|2| |Milk|7|5L| Product Price Quantity

Biscuits

5

2

Milk

7

5L

I

27 V

SETTING-UP ENVIRONMENT z

HereJs a complete list of shortcuts of various operations with cells. r

’I

Operations change cell to code change cell to markdown change cell to raw close the pager restart kernal copy selected cell cut selected cell delete selected cell enter edit mode extend selection below extend selection above find and replace ignore insert cell above insert cell below interrupt the kernal Merge cells paste cells above paste cells below run cell and insert below run cell and select below run selected cells save notebook scroll notebook up scroll notebook down select all show keyboard shortcuts toggle all line numbers toggle cell output toggle cell scrolling toggle line numbers undo cell deletion

Shortcut y m r Esc 0 + 0 c X

d + d Enter Shift Shift f Shift a b i + i Shift Shift

+ j + k

+ m + v

V

Alt + Enter Shift + Enter Ctrl + Enter Ctrl + s SHIFT + Space Space Ctrl + a h Shift + 1 0 Shift + o 1 z

■x

PANADAS UO LIBRARY • Features of Pandas library i'll • Series • Dataframes

I'1 pandas

o

J

/•

A

03

k _______ /

PANDAS LIBRARY ________________________________________ >

Pandas Data science requires high-performance data ma­ nipulation and data analysis, which we can achieve with Pandas Data Structures.Python with pandas is in use in a variety of academic and commercial domains, including Finance, Economics, Statistics, Advertising, Web Analytics, and more. Using Pandas, we can accomplish five typical steps in the processing and analysis of data, regardless of the origin of data - load, organize, manipulate, model, and analyse the data.

Key features of Pandas library We can achieve a lot with Pandas library using its features like: • Fast and efficient DataFrame object with default and customized indexing. • Tools for loading data into in-memory data objects from different file formats. • Data alignment and integrated handling of missing data.

'

if

30 —

' PANDAS LIBRARY

• Label-based slicing, indexing and subsetting of large data sets. • Columns from a data structure can be deleted or inserted. • Group by data for aggregation and transformations.

Series Pandas deals with data with itJs data structures known as series, data frames and panel. Series is an one-dimensional array like structure with homogeneous data. For example, the following series is a collection of integers

10

17

23

55

67

71

92

As series are homogeneous data structure, it can contain only one type of data [here integer]. So, we conclude that Pandas Series is: • It is a homogeneous data structure • Its size cannot be mutated • Values in series can be mutated

Data Frames DataFrame is a two-dimensional array with heterogeneous data.

Day

Sales

Monday Tuesday Wednesday Thursday

33 37 14 29

31

PANDAS LIBRARY

The data shows the sales of certain product for 4 days. You can think of Data Frames a container for 2 or more series. So, we conclude that pandas data frames is: • It can contain heterogeneous data • Its size is mutable • ALso its data is mutable.

We will use Pandas series and data frames a lot in the future lessons, make sure to go through the lesson again and get the grasp of it.

Key Points • Pandas library is a high performance data manupilation and data analysing tool. • Pandas data structures include series and data frames • Series is a 1-Dimensional array of homogeneous data, whose size is immutable but values in a series are mutable. • Data Frames is a 2-Dimensional array of heterogeneous data of 2 or more series, whose size and data are mutable.

■x

f"\/| NUMPY MH PACKAGE • Features of NumPy • ndarray Objects • List vs. ndarrays

o

J

A

04

NUMPY PACKAGE _________________________________________>

NumPy NumPy is a Python package which stands for 'Numerical Python'. It is a library consisting of multidimensional array objects and a collection of routines for processing of array.

3D array

2D array ID array 7

2

9

10

5.2

3.0

4.5

9.1

0.1

0.3

axis 0

axis 1

shape: (4,)

shape: (2, 3)

Key features of NumPy NumPy is powerful that consists of many features like : • Mathematical and logical operations on arrays. • Fourier transforms and routines for shape manipulation. • Operations related to linear algebra. NumPy has in-built functions for linear algebra and random number generation. • NumPy ndarrays are much much faster than Python Built-in lists and less memoray consuming. • Most of the part that requires fast computation are written C and C++

34

NUMPY PACKAGE •

ndarray objects NumPy aims to provide an array object that is up to 50x faster that traditional Python lists. The array object in NumPy is called ndarray, it provides a lot of supporting functions that make working with ndarray very easy. Arrays are very frequently used in data science, where speed and resources are very important. In NumPy, we can create 0-D,l-D,2-D and 3-D ndarrays.

0-D

(33)

1- D

([11,27,18])

2- D

([ 3, 5,6], [5, 7,11])

3- D

([ 5,8,19], [ 6, 9,10],

[4,1,11]) In breif ndarrays or n-dimensional arrays are: • It describes the collection of items of the same type. • Items in the collection can be accessed using a zero-based index. • Every item in an ndarray takes the same size of block in the memory. • Each element in ndarray is an object of data-type object (called dtype). Any item extracted from ndarray object (by slicing) is represented by a Python object of one of array scalar types.

35

NUMPY PACKAGE •

Lists vs. ndarray In Python we have lists that serve the purpose of arrays, but they are slow to process. NumPy aims to provide an array object that is up to 50x faster that traditional Python lists.

ndarrays

Lists • List is an array of heterogeneous objects • List arrays are stored in different places in the memory which, makes it slow to process data. • Lists are not optimized to work with latest CPU's • A 1-Dimensional List

['A',56,67.05]

• ndarray is an array of homogeneous objects • ndarrays arrays are stored in one continuous place in the memory which, makes it faster to process data. • ndarrays are optimized to work with latest CPU's • A 1-Dimensional ndarray

([ 12, 17, 25])

Lists arrays

memory Loe -12044567 memory too -12044568 memory too -12044569

0 x 310718 0 x 310719 0 x 310720 0 x 310721 0 x 310722 (----------------------------------------------------- X

List arrays memory allocation X_______________________________ /

1

0 x 310723 0 x 310726

memory too -12044570 memory too -12044571 memory too -12044572 memory too -12044573 memory too -12044574 memory too -12044575 memory too -12044576 memory too -12044577 memory too -12044578

36

NUMPY PACKAGE

-

ndarrays PyObject_Head

1

data

2 3 “7

dimensions strides

4 5 6 7

(--------------------------------

ndarrays memory allocation

\_____________________________ /

You can clearly understand why the built-in list arrays are slower than ndarrays. To accelerate and process data much faster we will use NumPy in future lessons., make sure to geta hold of it.

r

Key Points • NumPy stands for Numerical Python, which is a Python Package used for working with arrays. • It also has functions for working in domain of linear algebra, fourier transform, and matrices. • ndarrays or n-dimensional arrays are homogeneous arrays, which are optimized for fast processing. • ndarrays also provide many functions that makes it suitable to work with data

■x

SCIPY UD PACKAGE • Features of SciPy • Data Structures • SciPy Sub-Packages

o

1

A

/-----------------------------------------------------

05

SCIPY PACKAGE J

k

SciPy The SciPy library of Python is built to work with NumPy arrays and provides many user-friendly and efficient numerical practices such as routines for numerical integration and optimization. Together, they run on all popular operating systems, are quick to install. /------------In [1]:

#Import packages from scipy import integrate import numpy as np def my_integrator(a,b,c): my_fun = lambda x: a*np.exp(b*x)+c y,err = integrate.quad(my_fun,0,100) print(’ans: %1.4e, error: %1.4e' % (y,err)) return(y,err) z

NumPy

#CaLL function my_integrator (5,-10,3)

ans: 3.0050e+02, error: 4.5750e-10

Out[l]:

(300.5, 4.574965520082099e-10)

\_________________________

Key features of SciPy SciPy combined with NumPy results a powerful tool for data processing with features like: • The SciPy package contains various toolboxes dedicated to common issues in scientific computing. Its different submodules correspond to different applications, such as interpolation, integration, optimization, image processing, statistics, special functions, etc. • SciPy is the core package for scientific routines in Python; it is meant to operate efficiently on NumPy arrays, so that numpy and scipy work hand in hand. • SciPy is organized into sub-packages covering different scientific computing domains, which makes it more efficient.

/

39

SCIPY PACKAGE

Data structures The basic data structure used by SciPy is a mul­ tidimensional array provided by the NumPy module. NumPy provides some functions for Linear Algebra, Fourier Transforms and Random Number Generation, but not with the generality of the equivalent functions in SciPy. Except for these, SciPy offers Physical and mathematical constants, fourier transform, interpolation, data input and output, sparse metrics, etc. Dense Matrix

Sparse Matrix

1

2

31

2

9

7

34

22

11

5

1

11

92

4

3

2

2

3

3

2

1

11

3

9

13

8

21

17

4

2

1

4

8

32

1

2

34

18

7

78

10

7

9

22

3

9

8

71

12

22

17

3

13

21

21

9

2

47

1

81

21

9

21

12

53

12

91

24

81

8

91

2

61

8

33

82

19

87

16

3

1

55

54

4

78

24

18

11

4

2

99

5

13

22

32

42

9

15

9

22

1

3 4

9

3 2

1

8

3

54

21

9

4

21

1

1

17

1

1

9

13

4

2

47

1

19

8

16

81

21

9

55

11

2

22

21

Use of Sparse matrix _________ __________ J

SciPy sub-packages As we already know, SciPy is organized into sub-packages covering different scientific comput­ ing domains, we can import them according to our needs rather than importing the whole library. The following table shows the list of all the sub-packages of SciPy : [next page]

z—• 40 -

r--------------------------------------

SCIPY PACKAGE

scipy.constants

Mathematical constants

scipy.fftpack

Fourier transform

scipy.integrate

Integrate routines

scipy.interpolate

Interpolation

scipy.io

Data input and output

scipy.linalg

Linear algebra routines

scipy.optimize

Optimization

scipy.signal

Signal processing

scipy.sparse

Sparse matrices

scipy.spatial

Spatial data structures

scipy.special

Special mathematics

scipy.stats

Statistics

Key Points • SciPy Package is a toolbox which is used for common scientific issues. • SciPy together with NumPy creates a dynamic tool for data processing. • Along with NumPy functions, SciPy provides a lot of functions to perform different tasks with ndarrays. • SciPy is divided into sub-packages determined for different tasks.

■x

r\OMALPLOTLIB MO LIBRARY

’SSOf matplstlib • Data Visualization • PyPlot in Matplotlib

o

A

06J \______ /

MATPLOTLIB LIBRARY _____________________________J

MatPlotLib Matplotlib is a python library used to create 2D graphs and plots by using python scripts. It has a module named pyplot which makes things easy for plotting by providing feature to control line stylesj font properties., formatting axes etc. 50

40 30 20



*

Thur

Fri

10

Sun

Sat

X

10 8

• • •• ••*

6

10

20

30

Key features of MatPlotLib Matplotlib is the best choice for data visualization because of its features like: • It supports a very wide variety of graphs and plots namely - histogram, bar charts, power spectra, error charts, and many more. • It is used along with NumPy to provide an environment that is an effective open source alternative for MatLab. • Using its PyPlot module, plotting simple graphs or any other charts is very easy.

40

50

43

MATPLOTLIB LIBRARY •

Data Visualization Data visualization is the graphical representa­ tion of information and data. By using visual elements like charts, graphs, and maps, data visu­ alization tools provide an accessible way to see and understand trends, outliers, and patterns in data.

In the world of Big Data, data visualization tools and technologies are essential to analyze massive amounts of information and make data-driven decisions. Data visulaization helps us to view data in a graphical or more interesting way rather than viewing a big chunk of numbers in a uniform line. We will process, analyze and then visualize our data, if we don't visualize our data, it loose a lot of impact as it will in the form bar graphs, pie charts, etc.

44

MATPLOTLIB LIBRARY

PyPlot in Matplotlib matptottib.pyptot is a collection of functions that make matplotlib work like MATLAB. Each pyplot function makes some change to a figure: e.g., creates a figure, creates a plotting area in a figure, plots some lines in a plotting area, decorates the plot with labels, etc. To test it yourself, jump to lupyter Notebook and start of by importing the matplotlib. pyplot module. In [ ]:

import matplotlib.pyplot as mplt

To plot a simple graph, use the plot function and pass a list, and then use the show function to view the graph

We have successfully plotted our graph with some random values in a list. If we want we can name x and y axis using the xtabet and ytabet repectively.

Z—• 45

MATPLOTLIB LIBRARY

In [2]: import matplotlib.pyplot as mplt mplt.plot([l,3,6,9]) mplt.xlabel('X_Axis') mplt.ylabel('Y_Axis') mplt.show()

The graph has solid blue line, we change itJs color and the line style by passing another argument to the plot function like, 'ro' for 'r' red and 'o' circles. In [2]:

import matplotlib.pyplot as mplt mplt.plot([l,3,6,9],’ro*) mplt.xlabel(’X_Axis’) mplt.ylabel('Y_Axis') mplt.show()



9

8 7

Scikit Learn or Sklearn Scikit-learn or Sklearn is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python. This library, which is largely written in Python, is built upon NumPy, SciPy and Matplotlib.

Features of Sckikit Learn Scikit-learn focuses on modelling data. The followings are the most popular groups provided by the library: • Supervised learning algorithms, like Linear Regression, Support Vector Machine (SVM), Decision Tree etc. are the part of scikit-learn.

© $ ® • Unsupervised learning algorithms like clustering, factor analysis, PCA (Principal Component Analysis) to unsupervised neural

50

SCILKIT LEARN LIBRARY

• Cross Validation, Dimensionality Reduction, Ensemble methods, Feature extraction, Feature selection are also features of scikit learn that are used to check the accuracy of supervised models, reducing the number of attributes in a data, combining the prediction of multiple supervised models, extract features and identify useful featurews in adata, respectively.

no TYPES OF UO MACHINE LEARNING Supervised leorning • Unsupervise leorning • Deep learning • Reinforcement leorning • Deep reinforcement learning

o-------------

A

08

TYPES OF MACHINE LEARNING _____________________________ )

In the previous lessons we learned about the various libraries and packages used in the process of machine learning. Now letJs look at the types of machine learning

Types of machine learning The followings are the different type of machine learning: reinforcement Learnifig

{J^pervised Learning

^supervised Learning

deep reinforcement Learnirfg Deep Learning

Supervised learning As it's name suggest in supervised learning we train a machine learning model with supervision. We feed the data, train the model and tell the model to make predictions. If the predictions are correct we leave the model to work on itJs own from there else we help the model to predict correctly until it learns so. It is the same as teaching a child to solve questions at first until he can solve them on his own.

53

TYPES OF MACHINE LEARNING

Types of supervised learning Regression and classification are two types of supervised machine learning. They can be under­ stood as: • Regression, is the type of machine learning in which we feed the model with data like rA' (input, i.e. X) has value of 65 (output, i.e. Y), fB' has value of 66, etc. Based on the given data, the model learns the relation between the input and output (here fA' & 65). Once the machine is trained with sufficient data we provide a input let's say rC' and let the model predict the output, but you must know the real output of that input. You check the prediction with the real value and check whether it is correct or wrong. If the predictions are correct we pass the model. If the predictions aren't

54

TYPES OF MACHINE LEARNING

Regression inturn have different ypes like linear and logistic regression, which we will learn in it's separate lesson. • Classification, is the type of machine learning in which we feed data and the model classifies the data into different groups. Consider the following example,

the data has different type of shapes in it. We will teach the model which is what shape or what are the different groups in the data. We will provide the groups with their features like:

/

circle

square

• • \_____________



■ ■

oval

rounded squares

■ ■ J

55

TYPES OF MACHINE LEARNING

Now the trained model can classify any data after learning how the groups are formed. If a new shape is passed it will classify it according to what it has learned. Like regression, we will keep feeding it data until it classifies the data correctly.

Classification has also different types like decision tree, Naive Bayes classification, support vector machines, etc. We will learn about them in the lesson dedicated for this topic.

56

TYPES OF MACHINE LEARNING

Unsupervised learning Unlike supervised learning., we don't teach or check the predictions made by the models, instead we feed the data and ask for predictions directly. And it is obvious that much data you'll feed the results will be much accurate. Unsupervised learning is used in artificial intelligence applications like face detection, object detection, etc.

Deep learning Deep learning models are based on Artificial Neural Networks (ANN), or more specifically Convolutional Neural Networks (CNN)s. There are several architectures used in deep learning such as deep neural networks, deep belief networks, recurrent neural networks, and convolutional neural networks. These models are used to solve problems like: • computer vision, image classification, etc. • bioinformatics • drug design • games, etc. Deep learning is entirely a concept in itself that it is completely different type of machine learning or different from it, we will not discuss it in much detail but let's see what is it and how do it solves problems. Deep learning requires a lot of data and computational power. But nowadays high performance

-

57

TYPES OF MACHINE LEARNING

computing is available to us. Let's consider an example where the deep learning model tells us about whether an animal is horse or not. The net­ work consists of large amount of horse photos as data and analyze them and try to extract patterns from it like horns, color, saddle, eyes, etc.

the neural networks came to conclusion whether the animal his horse or not, but how did it reach the conclusion is unknown. The reasoning cannot be obtained from deep learning models that's it is also considered as black box.

Reinforcement learning Reinforcement learning consists of learning models which doesn't require any input or output data instead they learn to perform certain tasks like solving puzzles or playing games. If the model performs steps correctly it is rewarded points, unless it penalized. The models learns the more it performs from it's mistakes. The model creates data as it performs the functions unlike receiving data at the beginning. For example consider the following model:

58

TYPES OF MACHINE LEARNING a

Deep Reinforcement learning As the name suggests this is the combination of reinforcement and deep learning. Reinforcements algorithms are combined with deep learning to create powerful Deep Reinforcement learning models that is used in fields like robotics, video games, finance and healthcare. Many unsolvable problems are solved by these models. DRL models are still at new to us and there is a lot to learn about it

■x

MATHEMATICS FOR Vv? MACHINE LEARNING • • • •

O

Doto instances Statistics Probability Bayes Theorem

i 11 IDDIDI

J

A

09

MATHEMATICS FOR ML _____________________________ )

Although every mathematical calculation will be performed by the computer, but you need to know the important formulaes and mathematical notations even if you-’re not solving them yourselves. In this chapter we will go through the important concepts in mathematics required for machine learning

Data instances Data is what we need to perform all the functions, i.e. data is the base of everything. You need to know what types of data is required in which process. LetJs consider the following as our data:

r

Day

Sales

Monday Tuesday Wednesday Thursday

33 37 14 29

In the above table there are two columns Day & Sales and four rows. In the data, we have two things, feature and label i.e. feature of the data (numeric values like 33) and labels of the data (descriptive values like Monday). Here Day is the label and Sales is the label.

Monday Tuesday Wednesday Thursday

37 14 29

MATHEMATICS FOR MACHINE LEARNING

The labels also have the following two types: • Nominal, these data aren't ordered. They have no heirarchy or upper or lower status. In the following data the labels have either True or False value i.e. considered nominal data

Answer

Question

True False False True

01 02 03 04

• Oridnal, these data are ordered. They have upper or lower status. In the following data the labels have an order in teh values like Good > Average > Bad i.e. ordinal data

Product ID 101 102 103 104

Rating Average Good Average Bad

Similarly the features have also two different types. The followings are the two different types: • Discrete, or finite values. These values have a limit for example in the following data the feature, numbers of children (NoC) in different families is finite to 1, 2 or 3 i.e. called discrete data

Family

NoC

Smith's Matrin's Cox' s Hyde's

2 1 3 2

MATHEMATICS FOR MACHINE LEARNING

• Continuous, or infinite values. These values doesn't have a limit for example in the following data the feature, weight of different people isn't finite. It could be 110 pounds, it can be 110.20 punds or even 110.21 pounds i.e. continuous data.

Person

Weight

Thon Max Mary Alex

110 122 96 120

1

Data Collection Data is collected from many sources. Let's say we want data of the whole countrey. So we need to survey the whole population whihc is time consuming or we can select a sample of the population and survey it's data. The sample can be selected randomly, on the basis characters or other features. Likewise instead of feeding the a whole data set to model we can obtain a sample from it to save time and have better results.

\____________________ 7 Population

MATHEMATICS FOR MACHINE LEARNING

Statistics Statistics is often thought of data visualization like bar graphs, etc. but statistics also include data collection, data analysis and it's represen­ tation. As you may have learn't in school, we perform statistical analysis of data like finding tge central tedency and visulaizing the data onto graphs. Descriptive and Inferntial stastics are used in machine learning.

Descriptive statistics In descriptive statistics we work with the whole data i.e. population rather than sample. In de­ scriptive statistics we have the followings: • Central Tendency The mean, median and mode of an data is refered as it's central tendency. We can find each of them very easily, let's consider the following data and find it's central tendency,

Day

Sales

Monday Tuesday Wednesday Thursday

33 37 14 29

To find the mean or the average of the sales, we need to add all the values together and divide it with the total number of values sum of all values 33 + 37 + 14 + 29 ->Q nc mean (x) = ----------------------- = ----------------------- = 28.25 Total no. of values 4 the average or mean sales is 28.25. As mentioned earlier you don't need to perform the calculations, it will be done in computer. Even there is

MATHEMATICS FOR MACHINE LEARNING

seperate functions in the pandas library like pandas.mean() to find it, you just need to know about it and how the value is obtained, so you understand what happens in which analysis. To find the median or the middle value sort the numbers and if the data has odd numbers of data, the middle value is the median like

12 34(56)71 77 56 is the median of the above data. But if the number of data is even like our sales data we need to find the sum of the middle pair and divide that by 2

x sum of mid pair median (x) = --------------------2

29 + 33 2

And at last we have the mode or the most occuring value. It can be visually calculated but in our data we don't we any repeating values so we will consider the following example

®34(l2)71 77©56 78

12 is the mode in the above data as it ocurred for three times, there can be many repetitions in a data

65

MATHEMATICS FOR MACHINE LEARNING

• Variability or Spread Range, interquratile range, variance and standard deviation are referred to variability or measurement of spread

Spread

Range

Interquatile Range

>""■...... *"■......\ Variance

1

Standard Deviation

Range is the difference between the maximum and minimum value in a data. Like the range of the sales data is Z

range = max - min = 37-14 = 23

I______________ 7

Interquartile Range is similar to the range but a bit different. Let's consider teh following data

12 27 33 35 35 42 45 47 51 53 54 We will divide the data into quarters with the numbers as separators

12 27)33 [35 35] 42 [45 47] 51 [53 54] And subtract the third seperator from the first seprator i.e. interquartile range

Interquartile range = 3rd Seperator -1 st Seperator = 51-33 = 18 \_____________________________________________________ 7

J

*

•/ gg

MATHEMATICS FOR MACHINE LEARNING

--- '•-----------------Variance or difference of random variables from the expected value can be obtained with the fol­ lowing formula where x is individual data points, x is mean and n is the total number of data values

fi(xt-x)1 2

s2 =

-----------------n

If you want you can find the variance of any data by replacing the values in the formula but make sure to remember how the variance is found Next if we want to find deviation i.e. or the difference of each value from itJs average or mean, we can use the following formula where i represents the number of values in the data and u represents the mean

Deviation = (xt- u) If we want to find the deviation of a data of a population we will use the mean of the whole population which is represented by u

o2 = (xr u)2 But if we want to find the deviation of a data of a sample from a population, we will use the mean of the sample instead of the whole population which is represented by x i.e. called inferential statistics s2 = (xrx)2

Similarly we can find the standard deviation or the dispersion of data from itJs mean through the following formula where N is the number of data points or values o

1

——I

(xi-u)2 1=1

I

67

MATHEMATICS FOR MACHINE LEARNING

LetJs consider an example to understand how to find standard deviation. We will find the standard deviation of our sales data

First we need to find the mean or u i.e. 28.25 which we found earlier

Then we will find the difference of all the sales data from the mean and square them and add them

*

•/ gQ

MATHEMATICS FOR MACHINE LEARNING

--- '•-----------------Finally we can find the root of the product of 1/N and 87.74 to find O

° = fP^74 o = J~jTx87-74

[N = 4]

o = /zT.93

o =

4.67

Therefore, the standard deviation of the sales data is approximately 4.67 and as mentioned earlier don't sweat on the calculation just unserstand the application of the formula

Entropy and Information gain Entropy or the uncertainity in a data can be found with the following problem: N

H(S) = -^pt.logzpt 1=1

where S stands for set of all instances in a data, N refers to the number of distinct values and p. stands for probability of the event. Through entropy we can further calculate information gain from a varaible through the following formula Gain(A, S) = H(S)

v j=i

IS -I x H(SJ) |S|

where A is a feature or variable whose information gain is being calaculated, H(S) is the entropy of the whole dataset,|Sj| is the number of instances with value j of the feature A, |S| is all data

I

69

MATHEMATICS FOR MACHINE LEARNING

instances in the dataset, v is the set of distinct values of the feature A, H(Sj) is the entropy of subset of data instances of feature A and H(A,S) is entropy of feature A of the dataset Let's consider an example to understand what is entropy and information gain more clearly. We have the following dataset Day

Discount

Advertisement

Sales

1

10%

No

Average

2

25%

No

Maximum

3

20%

Yes

Maximum

4

10%

Yes

Maximum

5

25%

No

Average

6

10%

Yes

Maximum

7

20%

No

Maximum

8

20%

No

Maximum

9

10%

Yes

Average

10

20%

Yes

Maximum

You are told to find the best feature i.e. Discount or Advertisement to have Sales as Maximum. So which feature will you choose to create a model to predict the best values to have maximum sales? We will find the information gain from each feature to figure that out. Let's find the information gain from Discount, we have the following details about Discount

z....................... \ Total values 10

Ix

Max 7

Discount

.

10%

z \

V

✓ J

z

z

z

Avg. 3

\

zZ

Max 2

z

Avg 0

\

z Max 1

z

I

70

MATHEMATICS FOR MACHINE LEARNING

Now we can find the entropy of the whole dataset using the entropy formula N

H(S) = -EPi-log2Pi i=1

So total values or N is 10 and propbability of Avg. Sales and Max Sales are 3/10 and 7/10 resepectively N

H(S) = -EpJog^ i=1

Max Sales Probability

H(S) = -37;log2-^- -T7rlog2-^10 y 10 10 y 10 Avg. Sales Probability

H(S) ~ 0.82

After obtaining the entropy we can substitute the values in the information gain formula to find the information gain from Discount feature Discount

Sales

10%

Average

25%

Maximum

20%

Maximum

10%

Maximum

25%

Average

10%

Maximum

20%

Maximum

20%

Maximum

10%

Average

20%

Maximum

= 0.82-0.4-0.0-0.2 = 0.82 - 0.6 = 0.22

The information gain from the Discount feature is 0.22, the feature with highest information gain values is used in models to predict values to get better results. So if Advertisement is the feature then information gain is Avg. 1 Yes V _

Advertisement

>

•/ 71

MATHEMATICS FOR MACHINE LEARNING

Similarly we can find the information gain of the Advertisement feature Advertisement

Sales

No

Average

No

Maximum

Yes

Maximum

Yes

Maximum

No

Average

Yes

Maximum

No

Maximum

No

Maximum

Yes

Average

Yes

Maximum

Gain(A, S) = H(S) - H(A,S) = 0.82- ^(410544-1094)

+ 4(-4^4-4^4) = 0.82 - 0.36 - 0.445 = 0.82 - 0.805 = 0.015

It is clear that we need to use the Discount feature rather than Advertisement feature because of more information gain. And again as mentioned earlier just understand what's going on, this is one of the important techniques for data scientists

Confusion matrix Confusion matrix is used to calculate the accuracy of an model. To calculate that we need to create the confusion matrix first, let's consider we created a model that predicts weather and we asked it to predict whether it will rain for the next 30 days. After 30 days, we matched the predicted values with the actual values and found that the model predicted 25 days correctly but 5 were incorrect. Here 8 is referrred as True negatives(T-), 3 is referred X /---------Predicted Predicted as False positives(F+), 2 is 30 no rain rain referred as False negatives Actual (F-) and 17 as True Positives 8 3 no rain (T+). We can use these values Actual to calculate the accuracy of 17 2 the model using the following Vrain____ J

formula: accuracy =

So the accuracy of our model is 17 + 8 accuracy = ----------------------17+8+3+2 25 accuracy = ------30 accuracy = 0.8

Similarly we can calculate the error rate or mis­ classification rate using the following formula: (F+) + (F-) Error rate =---------------------------(T+) + (T-) + (F+) + (F-) 3+2 Error rate = ----------------------17+8+3+2

Error rate = ------30 Error rate = 0.2 It is same as 1-Accuracy. Next, we can calculate the precision of the model using the below formula:

Precision = ---------------(T+) +(F+) Precision = ---------------17 + 3 Precision = ----20 Precision = 0.85

precision can also be defined as how many true positive predictions our model makes

'

•/ 73

MATHEMATICS FOR MACHINE LEARNING

We can also calculate the Recall using the below formula:

Recall = Recall = Recall =

(T+) (T+) +(F-)

17 17 + 2

17 19

Recall = 0.89 And finally we can calculate F-measure if models have high precision & low recall or vice versa using the below formula:

F-measure = F-measure = F-measure =

2 x Recall x Precision

Recall + Precision 2 x 0.89 x 0.85 0.89 + 0.85 1.51 1.74

F-measure = 0.86

Probability It is the easiest mathematical calculation to predict the outcome of an event using the following formula

r Favourable outcomes Probability of event A =------------------------------Total outcomes

There are three different concepts required for machine learning i.e. Probability Density Function, Noraml distribution and central limit theorem which are both statistics and probability

Probability Density Function The probability function states the following points: • It is continuous over the range • Area under the curve and the x-axis is equals to 1 • Probability of events will lie between a and b Any variable that satisfies these conditions is called continuous random variable

Normal Distribution Variable's (features) with mean as 0 and variance as 1 are called noraml random variables. A normal distribution is an arrangement of a data set in which most values cluster in the middle of the range and the rest taper off symmetrically toward either extreme. In an normal distribution mean, median and mode are same. We can represent the distribution as the below graph:

where uis mean and is ostandard deviation. The formula of normal distribution can be represented as:

y (Normal Variable) = [1/ox V2TT1 e(x'u)2/2°2 where e = 2.718

7g L—

MATHEMATICS FOR MACHINE LEARNING

Central Limit Theorem The theorem states that the mean u of samples from a population should be equal to the population mean

V Population mean

V

V

Samplel mean

J

Sample2 mean

!

- J

Sample3 mean

J

Types of Probability Probability can be classified into the following three types: • Marginial probability, probability of an event without conditions like drawing a number from the first ten natural numbers. • Joint probability, probability of two events at once like drawing a red card with 4 number from a deck of cards ^4 • Conditional probability, 4 number ----- red coLor probability of one or more event with conditions. The condition may be fulfilled already or need at the moment of the event. For example, drawing a joker card from your friend, where it may or may not be already present.

MATHEMATICS FOR MACHINE LEARNING

Bayes Theorem Bayes theorem is way of finding probability of an event when we know about the probility of other events or conditions. The formula is given as:

P(A|B) = P(B|A) P(A)/P(B) Let's say P(Fire) means how often there is fire, and P(Smoke) means how often we see smoke, then: • P(Fire|Smoke) means how often there is fire when we can see smoke • P(Smoke|Fire) means how often we can see smoke when there is fire

So we know the following probabilities: • dangerous fire = 1% • fire with smoke = 10% • dangerous fire with smoke = 90% Then probability of dangerous fire when there is smoke:

P(Fire|Smoke) = 1% x 90% / 10% P(Fire|Smoke) = 9% Using bayes theorem we can find many probabilities of events. Even there is an naive bayes theorem model in machine learning which we will learn in the upcoming lessons.

■x

1 SCIKIT LEARN IV ALGORITMS • Regression algorithm • Classification algorithm • Clustering algorithm

o

r

10 ___ /

SCIKIT LEARN ALGORITHMS _________________________________>

Algorithms We already learnt about the different types of machine learning algorithms like, supervised learning, etc. But how will you choose the best algorithm for your problem? For that purpose we can understand the algorithms we are going to use so we can decide which algorithm is suitable for our problem

Regression Algorithm Any scikit-learn algorithm requires data points or values more than 50. The following visual will help you to understand the working of the scikit learn's regression model which used to predict quantity:

['name','sales']]

name

sales

0

biscuits

227

1

cookies

158

2

cake

50

3

whey_supplement

24

4

protein_bars

85

5

potato_chips

121

or just with some rows In [46]:

import pandas as pan dt = pan.read_csv("csv_data.csv") dt.loc[4:6j'name'>'sales']]

0ut[46]:

name

sales

4

protein_bars

85

5

potato_chips

121

To access a single element, we can use its row-column index with the values function In [49]:

import pandas as pan

dt = pan.read_csv("csv_data.csv")

dt

In [50]:

id

name

price

sales

brand

0

101

biscuits

5.00

227

HomeFoods

1

102

cookies

7.25

158

TBakery

2

103

cake

12.00

50

TBakery

3

104

whey_supplement

34.90

24

Muslellp

4

105

protein_bars

4.90

85

MusleUp

5

106

potato_chips

1.75

121

HomeFoods

biscuits_sales = dt.values[0,3] biscuits_sales

Out[50]:

227

0

Z-------------

dt.values 1

2

3

4

price

sales

brand

name

Zl01

biscuits

102

^^ceoKies

7.25

158

TBakery

cake

12.00

50

TBakery

2 z*

dt,va~Lues[Q,3]

1

id

5^^(227) HomeFoodsJ

3

104

whey_supplement

34.90

24

MusleUp

4

105

protein_bars

4.90

85

MusleUp

5

106

potato_chips

1.75

121

HomeFoods

The data values are stored as ndarrays so, to access single elements we can using slicing similar to that of DataFrames

86

IMPORTING DATA

Importing JSON Data JSON file stores data as text in human-readable format. ISON stands for JavaScript Object Nota­ tion. Get your sample ISON data here

json_data (Check the Resources) Pandas can read ISON files using the read_json function. In [2]:

import pandas as pan dt = pan.read_json("json_data.json") dt

ID

Name

Price

Sales

Brand

0

101

Biscuits

5.00

227

HomeFoods

1

102

Cookies

7.25

158

TBakery

2

103

Cake

12.00

52

TBakery

3

104

Whey Supplement

34.90

24

Muslellp

4

105

Protein Bars

4.90

85

MusleUp

5

106

Potato Chips

1.75

121

HomeFoods

Similar to the CSV files., we can perform all the slicing and data extraction with JSON data files In [6]:

import pandas as pan dt = pan.read_json("json_data.json") print(dt.loc[:,["ID","Name","Sales"]]) print(dt["Name"]) print(dt.values[5,4])

Name Sales Biscuits 227 Cookies 158 1 2 Cake 52 3 Whey Supplement 24 4 Protein Bars 85 5 Potato Chips 121 0 Biscuits Cookies 1 2 Cake 3 Whey Supplement 4 Protein Bars 5 Potato Chips Name: Name, dtype: object HomeFoods

0

ID 101 102 103 104 105 106

87

IMPORTING DATA

Importing EXCEL Data Microsoft Excel is a very widely used spread sheet program. Its user friendliness and appealing features makes it a very frequently used tool in Data Science. Get your sample ISON data here

xtsx_data (check the Resources) The read_excet function of the pandas library is used read the content of an Excel file into the python environment as a pandas DataFrame. In [9]:

import pandas as pan dt = pan.read_excel("xlsx_data.xlsx") dt

id

name

price

sales

brand

Unnamed: 5

0

101

biscuits

5.00

227

HomeFoods

NaN

1

102

cookies

7.25

158

TBakery

NaN

2

103

cake

12.00

50

34

TBakery

3

104

whey_supplement

34.90

24

Muslellp

NaN

4

105

protein_bars

4.90

85

Muslellp

NaN

5

106

potato_chips

1.75

121

HomeFoods

NaN

As execel sheets are imported as Pandas DataFrameSj we can perform all the tasks on the excel data like Data Frames. You may notice., we have a Unnamed: 5 column with NaN values [except dt.value[2,5]]. Let's clean up our data. First we need to remove the Unnamed: 5 column, which we can do using the det keyword In [10]:

import pandas as pan dt = pan.read_excel("xlsx_data.xlsx") del dt["Unnamed: 5"]

As we have learned earlier the det keyword removes the whole column we don't need to deal with the Data Cleansing

88

IMPORTING DATA

We have removed the Unnamed: 5 column In [11]:

import pandas as pan dt = pan.read_excel("xlsx_data.xlsx") del dt["Unnamed: 5"] dt

id

name

price

sales

brand

0

101

biscuits

5.00

227

HomeFoods

1

102

cookies

7.25

158

TBakery

2

103

cake

12.00

50

3

104

whey_supplement

34.90

24

MusleUp

4

105

protein_bars

4.90

85

MusleUp

5

106

potato_chips

1.75

121

HomeFoods

(34)

Now, we need to replace dt.value[2,5] i.e. 34 with TBakery. We can use the reptace method In [12]:

import pandas as pan dt = pan.read_excel("xlsx_data.xlsx") del dt["Unnamed: 5"] dt.replace({34:"TBakery"})

id

name

price

sales

brand

0

101

biscuits

5.00

227

HomeFoods

1

102

cookies

7.25

158

TBakery

2

103

cake

12.00

50

TBakery

3

104

whey_supplement

34.90

24

MusleUp

4

105

protein_bars

4.90

85

MusleUp

5

106

potato_chips

1.75

121

HomeFoods

So our data is clean with no errors. Try recaping the chapter and attempt the Exercise, where youJll be provided with sample data files [links] with lots of errors and you have to perform all the data cleansing practised in the previous lesson, this will be a very good exercise to help you understand about data processing and cleansing more

1 O DATA 1C OPERATIONS • NumPy operations • Pandas operations • Cleaning data

o

/--------

12

data analysis

'---------Python handles data of various formats mainly through the two libraries., Pandas and Numpy. We have already seen the important features of these two libraries in the previous chapters. In this chapter we will see some basic examples from each of the libraries on how to operate on data and perform different tasks like cleaning the data, analytics, etc.

NumPy Operations To start working with NumPy, we need to import numpy to create NumPy arrays. In [ ]:

import numpy

Now let's create an array, using the arrayO function and print it. In [2]:

import numpy ar = numpy.array([l,5,7]) print(ar) [1 5 7]

an is a 1-Dimensional array, we can also create a 2-Dimensional array by creating one or more 1-Dimensional array inside of another array In [8]:

import numpy ar = numpy.array([[l,5,7], [2,3,9]]) print(ar) [[1 5 7] [2 3 9]]

[[1 5 7] I [2 3 9]]J

1-D array = 2-D array

- 91

data operations

-J

We can specify the dimension of an array during creation using the ndmin parameter In [10]:

import numpy ar = numpy.array([lj5,7]> ndmin = 2 ) print(ar) [[1 5 ?]]

Although we passed a 1-Dimensional array, it became a 2-Dimensional array because of the specification of the dimensions of the array in the ndmin parameter We created an array with integers so, let's create arrays with strings and floats using the dtype parameter with the same values In [11]:

import numpy ar_str = numpy.array([l,5,7], dtype = str ) ar_flt = numpy.array([l,5,7], dtype = float) print(ar_str) print(ar_fIt) [■r '5' ■?■] [1. 5. 7.]

ar_str is an array of string literals and ar_ftt is an array of floats. We can also change these numbers to complex numbers the same way using complex as dtype In [13]:

import numpy ar_str = numpy.array([l,5,7], dtype = str ) ar_flt = numpy.array([l,5,7], dtype = float) ar_cmx = numpy.array([l,5,7], dtype = complex ) print(ar_str) print(ar_flt) print(ar_cmx)

fl1 '5' -7’] [1. 5. 7.] [l.+e.j 5.+0.J 7.+0.j]

ar.str, ar_fT_t and ar_cmx are arrays created with same data, but with different data types as strings, floats and complex numbers repectively.

92

data operations

Pandas Operations Pandas handles data through Series,Data Frame, and Panel. We will learn to create each of these.

Pandas Series We already know what Pandas Series is. A pandas Series can be created using the SeriesO function so, let's import pandas and create series. import pandas sr = pandas.Series([1,5,7]) print(sr)

In [14]:

0 1 1 5 2 7 dtype: int64

As you can see our data is indexed form 0 to 2 with the data type printed as integer, we can specify our own indexes in the index parameter In [16]:

import pandas sr = pandas.Series([1,5,7], index = ['A','B','C']) print(sr)

A B

1 5

C

7

dtype: int64

Like ndarrays, we can also specify the data type in pandas series using dtype parameter during series creation In [18]:

import pandas sr = pandas.Series([1,5,7], dtype = complex )

print(sr) 0

1.000000+0.000000j

1

5.000000+0.000000j

2

7.000000+0.000000j

dtype: complexl28

f

93

data operations

We can use a ndarray to create a pandas series In [19]:

import numpy import pandas ar = numpy.array([1,5,7]) sr = pandas.Series( data = ar, copy = True ) #is same as sr = pandas.series(ar, copy = True) print(sr) Q 1 1 5 2 7 dtype: int32

We passed the ar ndarray as the data for the series [use of the data parameter isn-’t necessary, its just for better understanding] and also used the copy parameter to create a copy of the data. If you want to get the data, without the indexes use the values function In [21]:

import numpy import pandas ar = numpy.array([l,5,7]) sr = pandas.Series(ar) print(sr.values) [1 5 7]

You can print a more detailed version of the above using the array function In [22]:

import numpy import pandas ar = numpy.array([l,5,7]) sr = pandas.Series(ar) print(sr.array) ■

[1, 5, 7] Length: 3, dtype: int32

ray Type

[1, 5, 7] Lengtl

{vplues

ta type

You can use values or array function according to your needs whether you want just the values or summarized detail of the arrays in that panda series. Also note the difference in the array function in NumPy and Pandas.

94

data operations

Pandas Data Frames Pandas Data Frames aligns data in a tabular fashion of rows and columns. A pandas DataFrame can be created using the DataFrameO function, we need pass a dictionary as the data In [23]:

import pandas df = pandas.DataFrame({"Product":['CookiesBiscuits'], "Sales":[157,227]}) print(df)

0 1

Product Cookies Biscuits

Sales 157 227

Dictionary keys are the columns and their values are the content of the rows of the Data Frame. We can also use index parameter here In [24]:

import pandas df = pandas.DataFrame({"Product":['CookiesBiscuits'], "Sales":[157,227]}, index = [1,2]) print(df)

1 2

Product Cookies Biscuits

Sales 157 227

We can define the columns and it's data seperately using ndarrays In [42]:

import pandas import numpy ar = numpy.array([[l,3],[6,2]])

df = pandas.DataFrame(data = ar, index = ['A','B'], columns = [,C1','C2']) print(df)

A B

Cl 1 6

C2 3 2

The data is stored in the ndarray and the columns are defined in the DataFrame's columns parameter. Note that, a 2-Dimensional ndarray with 2 1-Dimensional arrays in it is passed to the data parameter to act as the data

95

data operations

We can add columns to the DataFrame using the

[] = syntax In [44]:

import pandas import numpy ar = numpy.array([[1,3],[6,2]]) df = pandas.DataFrame(data = ar, index = ['A','B'], columns = ['Cl','C2']) df['C3'] = (df['Cl']*5) print(df)

A B

Cl 1 6

C2 3 2

C3 5 30

We can delete columns from the DataFrame using the det function In [45]:

import pandas import numpy ar = numpy.array([[1,3],[6,2]]) df = pandas.DataFrame(data = ar, index = ['A','B'], columns = [,C1,,'C2']) df['C3‘] = (df['Cl']*5) del df['C2'] print(df)

A B

Cl 1 6

C3 5 30

We can print a column of the DataFrame using the

[] syntax In [46]:

import pandas import numpy ar = numpy.array([[l,3],[6,2]]) df = pandas.DataFrame(data = ar, index = ['A','B'], columns = ['C1','C2']) print(df['Cl']) A 1 B 6 Name: Cl, dtype: int32

96



data operations

Slicing Syntax To get a single element from a ndarray or pandas series or pandas dataframes, we need to use the slice syntax [start:end:step(optional)] LetJs extract some elements from the arrays we have created so far. In [59]:

import numpy as npy arl = npy.array([1, 5])

ar2 = npy.array([[1, 3], [5, 2]])

ar3 = npy.array([[[1, 3], [5, 2]],

[[2, 4],

[4, 6]]])

#SLicing 1-Dimensional, array

print(arl[0]) #SLicing 2-DimensionaL array

print(ar2[0,l]) #SLicing 3-Dimensional, array print(ar3[l,0j1])

1

3 4

We use a comma } to slice further in 2 or more dimensional arrays, the following figure will help you understand the slicing of the 3-Dimensional array more better

ar3[]

Full orroy

[[[1, 3], [5, 2]], [[2, 4], [4, 6]]]

97

data operations •

ar3[l]

[[[C 3], [5, 2]]j [[2, 4]j [4, 6]] ]

First Slice

ar3[l,0]

Second Slice ar3[l,0,l]

[[[1> 3], [5, 2]]j l[2> 4], [4j 6]] ]

[[[1> 3]j [5, 2]]j [[2, 4 ], [4j 6]] ]

ar3[l,0,l] Y

iFinol Slicej

Slicing may seem a bit tough for beginners due to the dimensions, thatJs why I created the figure to help you understand slicing better. If you are confident try solving the slicing questions in the Exercise

*

’ 98

data operations

To get a element from a pandas series., we use the

[ or ] syntax In [4]:

import pandas as pan

sr = pan.Series([1, 3, 5], index = ['a','b','c']) print(sr['a']) #impLicit indexing

print(sr[0])

#expLicit indexing

1 1

If you have indexes like numbers like these In [6]:

import pandas as pan sr = pan.Series([1, 3, 5], index = [2,4,6])

If you want the second element using the implicit index [indexing defined in index parameter] use .Loc [] syntax and using the explicit indexing [ 0,1,2,... ] use . iloc [] syntax In [7]:

import pandas as pan sr = pan.Series([1, 3, 5], index = [2,4,6]) print(sr.loc[4]) print(sr.iloc[l]) 3 3

We can modify or delete the elements using slicing In [9]:

import pandas as pan sr = pan.Series([1, 3, 5], index = [2,4,6])

sr[4] = 7 print(sr)

del sr[6] print(sr) 2 1 4 7 6 5 dtype: int64 2 1 4 7 dtype: int64

99

data operations

•------------------------

Let's say we have a DataFrame like this In [27]:

import pandas as pan sr = pan.DataFrame({'Product':['Biscuit','Cookies'], 'Sales':[227,158]}, index = [1,2]) sr

0ut[27]:

Product Sales

1

Biscuit

227

2

Cookies

158

And want the Sales Column only, so use the

[] syntax In [32]:

import pandas as pan sr = pan.DataFrame({'Product':[1 Biscuit','Cookies'], 'Sales':[227,158]}, index = [1,2]) sr['Sales']

Out[32]:

1

227

2 158 Name: Sales, dtype: int64

or to get the second row only, so use the

.loc [] syntax In [33]:

import pandas as pan sr = pan.DataFrame({'Product':['Biscuit','Cookies'], 'Sales':[227,158]}, index = [1,2]) sr.loc[2] #You can aLso use sr.iLoc[l]

Out[33]:

Product Cookies Sales 158 Name: 2, dtype: object

or to get the sales of cookies only, so use the

.values[] syntax In [37]:

import pandas as pan sr = pan.DataFrame({'Product':['Biscuit','Cookies'], 'Sales':[227,158]}, index = [1,2]) sr.values[l,l]

Out[37]: 158 ________________________________________________________________________________________________

The values are stored as ndarrays, that's why it used slicing similar to that of 2-Dimensional ndarrays

We can delete a whole column from the DataFrame import pandas as pan sr = pan.DataFrame({'Product':['Biscuit','Cookies'], 'Sales':[227,158]}, index = [1,2]) sr

In [41]:

0ut[41]:

Product

Sales

1

Biscuit

227

2

Cookies

158

del sr['Sales'] sr

In [45]:

Out[45]:

Product

1

Biscuit

2

Cookies

but we cannot delete a value In [47]:

import pandas as pan sr = pan.DataFrame({'Product':['BiscuitCookies'], 'Sales':[227,158]}, index = [1,2]) del sr.values[l,l]

ValueError Traceback (most recent call last) in 2 sr = pan.DataFrame({'Product':['Biscuit','Cookies'], 3 'Sales':[227,158]}, index = [1,2]) ------ > 4 del sr.values[l,l]

ValueError: cannot delete array elements

nor you can modify a value In [48]:

import pandas as pan sr = pan.DataFrame({'Product':['Biscuit','Cookies'], 'Sales':[227,158]}, index = [1,2]) sr

0ut[48]: Product

Sales

1

Biscuit

227

2

Cookies

158

In [51]:

sr.values[l,l] = 162 sr.values[l,l]

0ut[51]:

158

101

data operations

More with ndarrays We can reverse a ndarray using [ ::-1] syntax In [56]:

import numpy as npy ar = npy.array([1,2,3,4]) ar

Out[56]:

array([l, 2, 3, 4])

In [55]:

ar = ar[::-1] ar

Out[55]:

array([4, 3, 2, 1])

We can broadcast a whole ndarray without doing it the long way

In [63]:

import numpy as npy ar = npy.array([5,1,3,9]) ar

Out[63]:

array([5, 1, 3, 9])

In [64]:

ar.sort() ar

Out[64]:

array([l, 3, 5, 9])

There are many built-in ndarray methods that will not be discussed now, but will be used in the future lessons in various steps, you may go to the documentation to find all the functions and their roles, as we don't require every function for our data processing and analyzing, all the miscellaneous functions are not discussed in this book

DATA OPERATIONS

Data Cleansing Let's consider a situtation like below In [71]:

import pandas as pan import numpy as npy ar = npy.array([[1,2,3],[4,7,2],[4,9,1]]) df = pan.DataFrame( data = ar, index = [' a ', ' c ', ' e ' ], columns = ['Cl','C2','C3']) df

0ut[71]:

In [72]:

C1

C2

C3

a

1

2

3

c

4

7

2

e

4

9

1

df = df.reindex(['a','b','c','d','e']) df

0ut[72]:

C1

C2

C3

a

1.0

2.0

3.0

b

NaN

NaN

NaN

c

4.0

7.0

2.0

d

NaN

NaN

NaN

e

4.0

9.0

1.0

The reindexed Data Frame has NaN values in the b and d rows. This happened because, there is no data for b and d rows. Using reindexing, we have created a DataFrame with missing values. In the output, NaN means Not a Number. To make detecting missing values easier (and across different array dtypes), Pandas provides the isnuULO and notnullO functions, which are also methods on Series and DataFrame objects

Name: Cl, dtype: bool

103

data operations

Pandas provides various methods for cleaning the missing values. The fillna function can fittna NaN values with non-null data in a couple of ways like replacing NaN values with 0 In [74]:

df.fillna(0)

Out[74]:

C1

C2

C3

a

1.0

2.0

3.0

b

0.0

0.0

0.0

c

4.0

7.0

2.0

d

0.0

0.0

0.0

e

4.0

9.0

1.0

We can copy the value above or below that data using

pad or bfi’L’L in method parameter of fittna function In [75]:

df.fillna( method = 'pad' )

C1

C2

C3

a

1.0

2.0

3.0

b

1.0

2.0

3.0

c

4.0

7.0

2.0

d

4.0

7.0

2.0

e

4.0

9.0

1.0

We can drop the rows with missing values with dropna function In [76]:

df.dropna()

0ut[76]:

C1

C2

C3

a

1.0

2.0

3.0

c

4.0

7.0

2.0

e

4.0

9.0

1.0

If we want to change a single value in a Data Framejwe can use the replace function In [78]:

import pandas as pan import numpy as npy ar = npy.array([[1,2,3],[4,7,2],[4,9,1]]) df = pan.DataFrame( data = ar, index = [' a',' c', ' e' ] columns = ['Cl','C2'>'C3']) df.replace({3:13})

0ut[78]:

C1

C2

C3

a

1

2

13

c

4

7

2

e

4

9

1

1 Q DATA ANALYSIS IO S PROCESSING • Doto Analytics • Correlations between attributes • Skewness of the data

o

A

13 _____ /

data analysis s processing _________________________________________ y

As we learned in the mathematics for machine learning lesson, we need to a lot of analytics or statistics of our data to know more about the data. As we already know central tendency i.e. mean, median and mode are the basic statistics of our data which tells us about the average of the data, 50% or middle value and the most occuring value in the whole data Likewise we will analyze our data and as mentioned earlier we don't need to calculate them manually or through formula's, there's plenty of functions present in different libraries to conduct the analysis

Data analytics

Before training any model we need to check the data and it's details. We will use the ftress.csv' as your data for now. You can get the file either scanning the qr-code or https://defmycode.cf/wp-content/u the link. Make sure to move the file your home directory of lupyter Notebook and then import the csv data. Before doing any further action, let's have a look at our raw data In [2]:

import pandas dt = pandas.read_csv(’trees.csv') dt

Out[2]:

Index

"Girth (in)”

"Height (ft)”

”Volume(ftA3)”

0

1

8.3

70

10.3

1

2

8.6

65

10.3

2

3

8.8

63

10.2

3

4

10.5

72

16.4

4

5

10.7

81

18.8

107

data analysis

a processing

The first analysis is to know the shape of the data or how amny rows and columns are present in the data. We can do so by using the shape attribute of the dataframe object In [4]:

import pandas dt = pandas.read_csv('trees.csv') dt.shape

0ut[4]:

(31, 4)

So our data has 31 rows and 4 columns i.e. 124 values in total. If we want we can just inspect the first 10 values using the head 0 function and passing 10 as argument In [5]:

import pandas dt = pandas.read_csv('trees.csv') dt.head(10)

Index

"Girth (in)"

"Height (ft)"

"Volume(ftA3)"

0

1

8.3

70

10.3

1

2

8.6

65

10.3

2

3

8.8

63

10.2

3

4

10.5

72

16.4

4

5

10.7

81

18.8

5

6

10.8

83

19.7

6

7

11.0

66

15.6

7

8

11.0

75

18.2

8

9

11.1

80

22.6

9

10

11.2

75

19.9

To get a statistical overview of the whole data we can use the describeO function which provides 8 properties i.e. count, mean, standard deviation, minimum value, maximum value, 25% (first interquartile sperator), 50% (median) and 75% (third interquartile seperator)

In [6]:

import pandas dt = pandas.read_csv('trees.csv') dt.describe()

Index

"Girth (in)"

"Height (ft)"

"Volume(ftA3)"

count

31.000000

31.000000

31.000000

31.000000

mean

16.000000

13.248387

76.000000

30.170968

std

9.092121

3.138139

6.371813

16.437846

min

1.000000

8.300000

63.000000

10.200000

25%

8.500000

11.050000

72.000000

19.400000

50%

16.000000

12.900000

76.000000

24.200000

75%

23.500000

15.250000

80.000000

37.300000

max

31.000000

20.600000

87.000000

77.000000

If you want the values rounded-off to say 2 decimal places we can use the pandas set_option() function and specify the precision as 2. We can specify a lot of options through this function

In [7]:

import pandas dt = pandas.read_csv('trees.csv') pandas.set_option('precision',2) dt.describe()

Out[7]:

Index

"Girth (in)"

"Height (ft)"

"Volume(ftA3)"

count

31.00

31.00

31.00

31.00

mean

16.00

13.25

76.00

30.17

std

9.09

3.14

6.37

16.44

min

1.00

8.30

63.00

10.20

25%

8.50

11.05

72.00

19.40

50%

16.00

12.90

76.00

24.20

75%

23.50

15.25

80.00

37.30

max

31.00

20.60

87.00

77.00

'

•/> 109 DATA ANALYSIS S PROCESSING ----- '•--------------------------------

Correlation between attributes The relation between two attributes (feature or label) in a data is called relationship. It is important to know the relations between the attributes. We can do so using the corr() function and using the Pearson's Correlation Coefficient to calculate that. The Pearson's Correlation Coefficiet can be understood by the following: • 1 represents positive correlation • 0 represents no relation at all • -1 represents negative correlation In [2]:

import pandas dt = pandas.read_csv('trees.csv') pandas.set_option('precision' ,2) dt.corr(method='pearson')

Index

"Girth (in)"

"Height (ft)"

"Volume(ftA3)"

Index

1.00

0.97

0.47

0.90

"Girth (in)"

0.97

1.00

0.52

0.97

"Height (ft)"

0.47

0.52

1.00

0.60

"Volume(ftA3)"

0.90

0.97

0.60

1.00

Note that we used the precision of the values as 2 to keep the values rounded-off to 2 decimal places. In the corr() function we specified pearson in the method parameter. As we already know that Girth, Height and Volume of tree are correlated that's why we get the values around 0.5 - 1.0 which represents positive correlationship i.e. if Height is changed the volume will be affected, if the Girth is changed the volume will be affected and vice versa

*

•/ 110 DATA ANALYSIS S PROCESSING ----- '•--------------------------------

Skewness of the data Skewness of a data is the situation when the data appears to have normal distribution but it may be skewed to either left or right. We need the skewness of a data to correct the data during it's preparation. The more the value is close to 0 it is less skewed and more the value is close it -1 or 1 it is more skewed to either left or right side, let's check the skewness of our tress data using the skew() function In [3]:

import pandas dt = pandas.read_csv('trees.csv') pandas.set_option('precision',2) dt.skew()

Out[3]:

index "Girth (in)" "Height (ft)" "Volume(ftA3)" dtype: float64

0.00 0.55 -0.39 1.12

As the index column has values from 1 to 31 it's skewness is 0 i.e. not skewness at all. On the other hand Girth can be said to be skewed to the right side, Height is skewed to the left side and Volume is highly skewed to the right side i.e. beyond 1. While data preparation we must consider the skewness and keep it close as much as possible to 0

Data Processing Before feeding the data to models we need to pre-process the data because the algorithms are completely depended on the data so it must be clean and appropriate as much as possible. While finding skewness we found that you data is skewed i.e. it needs to be closer to 0 for better results, so let's look at some processes to ready our data

• /---------------------------------

Ill

data analysis s processing

Scaling Our data is spread over a wide range with different scales i.e. not suitable to train models. We need to bring our data in a more appropriate scale, we can do so using the MinMaxScalen class and it's fit_transforrn() method of the scikit-learn library. We can scale our data in the range of 0 to 1 which is the most appropriate range for the algorithms In [28]:

import pandas from sklearn import preprocessing dt = pandas.read_csv('trees.csv') ar = dt.values # array # Scoter Object Sclr = preprocessing.MinMaxScaler(feature_range=(0?l)) skl_ar = Sclr.fit_transform(ar) ^Seating # Seated data skl_dt = pandas .DataFrame(skl_ar_, columns=['S.No.','Height','Height','Volume']) skl_dt.round(1).loc[5:10]

Out[28]:

S.No.

Height

5

0.2

0.2

0.8

0.1

6

0.2

0.2

0.1

0.1

7

0.2

0.2

0.5

0.1

8

0.3

0.2

0.7

0.2

9

0.3

0.2

0.5

0.1

10

0.3

0.2

0.7

0.2

Height Volume

You can compare the values with the values beside i.e. unsealed. If you want you can change the range to say 0-100 through the feature_range parameter in MinMaxScaler while the scaler class intialization

S.No.

Girth

5

5

10.7

81

18.8

6

6

10.8

83

19.7

7

7

11.0

66

15.6

8

8

11.0

75

18.2

9

9

11.1

80

22.6

10

10

11.2

75

19.9

Height Volume

*

•/ 112 DATA ANALYSIS S PROCESSING ----- '•--------------------------------

Normalization Normalization is used to rescale each row of data to have a length of 1. It is mainly useful in Sparse dataset where we have lots of zeros. There are two types of normalization process namely LI and L2. With the LI method, all the values in each row will sum upto 1. We can demonstrate the same using the Normalizer class and it's transform method. To use the LI method specify fLl' in the norm parameter of the class In [3]:

import pandas from sklearn import preprocessing dt = pandas.read_csv('trees.csv') ar = dt.values Nm = preprocessing.Normalizer(norm='11') nm_ar = Nm.transform(ar) print(nm_ar[:5]) # Sum of the rows for i in [0,1,2,3,4]: print(sum(nm_ar[i]))

[[0.01116071 0.09263393 [0.02328289 0.10011641 [0.03529412 0.10352941 [0.03887269 0.10204082 [0.04329004 0.09264069 1.0 1.0 1.0 0.9999999999999999 1.0

0.78125 0.75669383 0.74117647 0.69970845 0.7012987

0.11495536] 0.11990687] 0.12 ] 0.15937804] 0.16277056]]

We created the Nm object of the Normalizer class with the normalizing method as 11 in the norm parameter while intialization and normalized our ar data values with the transform method of the Normalizer class and stored it in nm_ar variable. Then we printed the 5 rows of the normalized data values We also created a for loop to print the sum of each row of the normalized data i.e. l(except for the 4th row i.e. 0.99). Note that we didn't per­ form any rounding-off

In the next method i.e. L2 Normalization, all the squares of values in each row sum upto 1. So let's use f12' in the norm parameter and check their sums In [12]:

import pandas from sklearn import preprocessing dt = pandas.read_csv('trees.csv') ar = dt.values Nm = preprocessing.Normalizer(norm='12') nm_ar = Nm.transform(ar)

print('L2 Normalization\n') print(nm_ar[:5]) # Sum of vaLues in the rows print('\nSum of the values in each row\n') for i in [0,1,2]: print(sum(nm_ar[i])) # Sum of the squares of the vaLues in the rows sm_row = '\nSum of squares of the values in each row\n' for i in [0,1,2,3]: print(sm_row) sm_row = 0 for val in nm_ar[i]: sm_row += val*val

L2 Normalization

[[0.01403589 [0.03012017 [0.04651593 [0.05355175 [0.05953254

0.11649791 0.12951675 0.13644674 0.14057333 0.12739964

0.98251251 0.97890567 0.97683459 0.96393143 0.96442719

0.1445697 ] 0.1551189 ] 0.15815417] 0.21956216] 0.22384236]]

Sum of the values in each row

1.2576160130346932 1.2936614951212533 1.3179514295489714 Sum of squares of the values in each row 1.0 1.0 1.0

*

•/ 114 DATA ANALYSIS S PROCESSING ----- '•--------------------------------

The code may a bit hard to understand because of the for loop but let's try to understand it. We normalize the data as we did we before but this time we used the f12' method and printed the data values. Then we printed the sum of the values of first three rows of the normalized data but they didn't sum upto 1. Next as the L2 method states, we printed the sum of the squared values in the first three rows using for loop which turned out to be exactly 1 Before the for loop, we created an sm_row varaible in which we will add our squared values in the rows but we stored a string at start. Then we created the outer loop in which we will get the index of the rows. Also we entered one more number in the list because at the first iteration the string in the sm_row will be printed and after printing it we changed it's value to 0 and then we create the innner in which we will perform addition. In each iteration of the inner loop we will add the square of each element in the rows with += compound assignment operator. After all the values are sumed up, we return to the outer loop and print it and again revert the value to 0 to store teh sum for the next row until all the values are printed (Sum holder variable (vessel)]

[sm_row] = [’\nSum of squares of.7) for i in [0,1,2,3]: [ ist’Run j[1st Run]

print (sm_row)

L

~—I

1.0 1.0 X_____

for i in [0,1,2,3]: [2nd Run] nT)--------- [print (sm_row)}*—I /[sm_row = 0] Output

_______ z

Binarization In binarization we binarize our data i.e. reduce differences to only two to leave crisp vales with a threshold. For exaple if we set the threshold to 10, all the value in a data set under 10 will be converted to 0 and above 10 will be converted into 1. Let's binarize our data with Binarizer class and transform() method In [21]:

import pandas from sklearn import preprocessing dt = pandas.read_csv('trees.csv') ar = dt.values Nm = preprocessing.Normalizer(norm='11') nm_ar = Nm.transform(ar) Bin = preprocessing.Binarizer(threshold=0.1) bin_ar = Bin.transform(nm_ar) bin_ar[10:16]

0ut[21]:

array([[0., [0., [1-, [1., [1-, [1-,

0., 1-, 1.], 0.t 1., 1-], 0., 1., 1-], 1., 1., 1.], 0., 1.], I-]]) 1.,

As you can see used the normalized (LI) that had a range of 0 to 1 which made things easier to set a threshold which is specified in the threshold parameter i.e. 0.1 So all the values below 0.1 are changed to 0 and all the values above 0.1 are changed to 1

0

'

•/> 116 DATA ANALYSIS S PROCESSING ----- '•--------------------------------

Standardization Standardization or Standard scaling is the method of changing the distribution of data arttributes to Gausiann distribution (Normal distribution). In this mthe mean is changes to 0 and standard devia­ tion is changed to 1. Let's standardize our data using the StandardScater class and it's fit() and transform() methods In [14]:

import pandas import numpy from sklearn import preprocessing dt = pandas.read_csv(1 trees.csv1) ar = dt.values # Standardizer Std = preprocessing.StandardScaler().fit(ar) std_ar = Std.transform(ar) print(std_ar[0:3])

print(’Mean:1, round(numpy.mean(std_ar)} 2)) print('Std.Deviation:'}round(numpy.std(std_ar)>2)) [[-1.67705098 [-1.56524758 [-1.45344419 Mean: -0.0 Std.Deviation:

-1.60291968 -0.9572127 -1.22883711] -1.50574137 -1.75488995 -1.22883711] -1.44095583 -2.07396086 -1.23502119]]

1.0

While the StandardScater object intialization we also called the fit() function to fit the scaler to our ar array and also transformed it, if you don't call the fit you'll get an error like

This StandardScater instance is not fitted yet. Catt ’fit1 with appropriate arguments before using this estimator you can also use the fit & transform functions in the previous preprocessing methods, for demonstration purpose they aren't used in previous examples but make sure to use them in it's applications Note that we used meant) and std() functions of the numpy package to calcualte the mean and standard deviation i.e. 0 and 1

117 DATA ANALYSIS S PROCESSING ----- '•--------------------------------

Label encoding In many cases our data has more labels (word) than features (numeric) but using words (strings) in processing limits many activities. For that purpose we need to change those labels into numeric notations or features like the following example In [15]:

import pandas from sklearn import preprocessing dt = pandas.DataFrame({'Questions':['A'>'B'> 'C','DE'], ’Answers':[’True’> 'True','False','True','False']}) dt

Out[15]:

Questions

Answers

0

A

True

1

B

True

2

C

False

3

D

True

4

E

False

We can use the Label-Encoder class for label encoding In [17]:

import pandas from sklearn import preprocessing dt = pandas.DataFrame({'Questions':['A','B','C','D','E'], 'Answers':['True','True','False','True','False']}) Enc = preprocessing.LabelEncoder( ) Enc.fit(dt['Answers']) # Encoded LabeLs dt['Answers'] = Enc.transform(dt['Answers']) dt

Out[17]:

Questions

Answers

0

A

1

1

B

1

2

C

0

3

D

1

4

E

0

'

•/> 118 DATA ANALYSIS S PROCESSING ----- '•--------------------------------

As you can see that we had the Questions as A-E and Answers as True or False. But we encoded the Answers label to be 0(False) or l(True) If we want we can get the label for the value or decode the 0 or 1 values using the inverse_transform() function In [18]:

import pandas from sklearn import preprocessing dt = pandas.DataFrame({'Questions':['A'B' C','D'>' E'], ’Answers*:[’True’,'True',’False’,'True','False']}) Enc = preprocessing.LabelEncoder() Enc.fit(dt['Answers']) # Encoded LabeLs dt['Answers'] = Enc.transform(dt['Answers']) # Decoding LabeLs print(Enc.inverse_transform([0,1]))

['False'

'True']

By encoding we can hide the true values and perform a lot operations with them because they are numerical values. In this data we had less only two label values i.e. True and False, but when there are more values the encoding will range from 0 to their respective lengths

Age < 23.0 gini = 0.375 samples = 4 value = [1.0, 0. 3. 0. 0.0] class = Vanilla j

Gender < 0.5 gini = 0.667 samples = 12 value = [2. 6. 0. 0. 2. 0. 2] class = Strawberry

z

Age < 15.5 gini = 0.444 samples = 6 value = [0. 4. 0. 0. 0. 0. 2] class = Strawberry

y

Age< 13.5 gini = 0.667 samples = 6 value = [2. 2. 0. 0. 2. 0. 0] class = Chocolate X,

z

gini = 0.0 samples = 2 value = [0. 2, 0. 0, 0. 0, 0] class = Strawberry

■X Age< 18.0 gini = 0.5 samples = 4 value = [2, 0. 0. 0. 2. 0. 0] class = Chocolate

X.-

s

"X

gini = 0.0 samples = 2 value = [2. 0. 0. 0. 0. 0. 0] class = Chocolate y

Likewise we can use the decision tree to solve different kind of problems based on classification But you may also ask how does the tree creates those comparisions or splits? It isn-’t necessary to know but you should. First the algorithm calculates the gini index for each attribute using the below formula: p2 + q2 which is the sum of the square of probability for success(p) and failure(q). Then the dataset is splitted into two lists of rows having index of an attribute and a split value of that attribute. Then it finds the best possible split by evaluating the cost(gini) of the split

151

CLASSIFICATION •

Logistic Regression Logistic regression is a type of model that predicts the outcome of output values as Yes or no as numeric values 1 or 0 respectively. We can use these type of models to classify a day as rainy or notj a person as healthy or sick, etc. But there are different types of logistic regression used for to different situations.

Binomial Logistic Regression

Binomial or binary logistic regression used to predict exactly two outcomes i.e. either l(positive) or 0(negative) Let's use an dataset to predict whether it will rain or not if the temperature and humidity percent are provided as input. You can download the data set from here Let's import the modules and the https://defmycode.cf/wp-conte dataset together. This time we will import linear_model and train_test_split from sklearn rhprk thp Rpsourrps

import pandas from sklearn import linear_model from sklearn.model_selection import - train_test_split from sklearn.metrics import accuracy_score In [1]:

import pandas from sklearn import linear_model from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score

We also imported the accuracy_score0 function from sklearn. metrics to calculate the accuracy of our model Now we can import our dataset and this time let's view it as it is

152

In [2]:

CLASSIFICATION

import pandas from sklearn import linear_model from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score dt = pandas.read_csv('Rainfall_data.csv') dt

Out[2]:

Unnamed: 0

Temperature

Humidity%

Rain

0

0

34

74.2

Yes

1

1

19

68.2

No

2

2

28

67.2

Yes

3

3

29

66.6

Yes

4

4

26

57.9

Yes

19995

19995

30

77.9

Yes

19996

19996

20

74.8

Yes

19997

19997

14

69.4

No

19998

19998

20

60.6

No

19999

19999

22

64.8

No

20000 rows x 4 columns

As you can see we have 20000 rows and 4 columns worth of data! Now we can move on to a new cell and perform the splitting of the data into train-test input and train-test output In [3]:

Input = dt.drop(columns=['Unnamed: O'^'Rain']) Output = dt['Rain'] inp-X, tst_X, out—y, tst_y = train_test_split( Input.values,Output,test_size=0.01)

We stored the input features i.e. Temperature in Input and the output i.e. Rain (Yes or No) in Output. Then we passed these values to the train_test_sptit0 function and splitted the data into training input, testing input, trainging output and testing output where the test size is 0.01 (1% i.e. 200)

z

< 153

CLASSIFICATION

Now we can create out logistic regression CModel and train it In [4]:

Input = dt.drop(columns=['Unnamed: 0','Rain']) Output = dt['Rain1] inp_Xj tst_X, out_y, tst_y = train_test_split( Input.values,Output,test_size=0.01) CModel = linear_model.LogisticRegression() CModel.fit(inp_Xjout_y)

0ut[4]:

LogisticRegression()

So our model is ready to make predictions, letJs move onto a new cell and let the model predict. Then we will compare the values and print the accuracy score In [5]:

from sklearn import preprocessing pred_y = CModel.predict(tst_X) Enc = preprocessing.LabelEncoder().fit(['Yes'No']) cmp = pandas.DataFrame({'Predicted':Enc.transform(pred_y), 'Actual':Enc.transform(tst_y)}) print('Accuracy Score:',accuracy_score(tst_y,pred_y)) cmp.plot(kind='density')

Accuracy Score: 0.91 Out[5]:

So the model has accuracy score of 0.91 i.e. 91%, which is really good! You can also see the density plot where only 9% of values are predicted wrong by the model

154

CLASSIFICATION

So how did our model predicted teh values or how do the logistic regression works? To understand we will see what is the mathematics behind the algorithm, if you want you can move ahead or give it read. The followings are the steps of linear function of binomial logistics regression: • We already know that the output will be either 0(No) or l(Yes). For that the linear function is basically used as an input to another function such as g in the following relation

h0(x) = g(0Tx) [0

h0 sS 1 ]

gis the logistic or sigmoid function which can be found with the following formula:

where z is 0Tx • We can visualize the sigmoid curve can be understood by the following graph

the classes can be divided into positive or negative. The output comes under the probability of positive class if it lies between 0 and 1. For our implementation, we are interpreting the output of hypothesis function as positive if it is bigger than or equal to 0.5 (>0.5), otherwise negative • We also need to define a loss function to measure how well the algorithm performs using the weights on functions, represented by 6 and h is equal to g(X0):

155

CLASSIFICATION

after defining the loss function our prime goal is to minimize the loss function • It can be done with the help of fitting the weights which means by increasing or decreasing the weights. With the help of derivatives of the loss function with respect to each weight, we would be able to know what parameters should have high weight and what should have smaller weight. The following gradient descent equation tells us how loss would change if we modified the parameters: 60j

=—XT (g(X0) — y) m

Multinomial Logistic Regression

As the name suggest this time we will have to pre­ dict outputs more than 2 times. In multinomial lo­ gistic regression we perform classification into 2 or more categories also the categories can be just different types like Rain, Hailstorm, Snow, etc. or ordinal like Heavy rain, moderate rain or low rain­ fall Let's consider the previous situation where we predicted whether it will rain or not, so let's create a model to predict https://defmycode.cf/wp-content/uplo whether it will rain heavy, moderate or low. You can download the dataset from here and import the modules as we did while creating model to predict the rainfall Chprk the Rpsourcps

data

In [1]:

import pandas from sklearn import linear_model, metrics from sklearn.model_selection import train_test_split

156

CLASSIFICATION

Now we can import our data and preview it without the head() function In [2]:

import pandas from sklearn import linear_model.> metrics from sklearn.model_selection import train_test_split dt = pandas.read_csv('RainfallData.csv') dt

Temperature

Humidity%

Rainfall

0

34

74.2

Low

1

19

68.2

No Rain

2

28

67.2

Moderate

3

29

66.6

Moderate

4

26

57.9

Low

17996

31

89.7

No Rain

17997

21

84.7

No Rain

17998

28

74.7

No Rain

17999

30

78.2

No Rain

18000

34

80.4

Low

18001 rows x 3 columns

We have the same temperature, Humidity percent columns but the rain is classified into No rain, low, moderate and high. Now we can move onto the next i.e. splitting the data In [3]:

Input = dt.drop(columns='Rainfall') Output = dt.drop(columns=['Temperature','Humidity%']) inp-Xjtst-XjOut-Yjtst-y = train_test_split( Input,Output,test_size=0.1)



Next we need to scale our Input data (optional) or we may encounter error. We will import preprocessing module and scale our input data. Then we can split our data into training and testing sets and train our model after creating it

157

CLASSIFICATION

In [4]:

from sklearn import preprocessing Input = preprocessing.scale(dt .drop(columns='Rainfall').values) Output = dt['Rainfall'] inp_X,tst_X,out_y,tst_y = train_test_split( Input,Output,test_size=0.2) CModel = linear_model.LogisticRegression() CModel.fit(inp_X,out_y)

0ut[4]:

LogisticRegression()

Our model is trained. Now we can test out model's predictions with actual values. To visualize it we need to use the LabetEncoder and encode the Rainfall labels into numeric values. We will also print the accuracy of our model In [5]:

pred_y = CModel.predict(tst_X) Enc = preprocessing.LabelEncoder().fit(['No Rain', 'Low','Moderate','High']) cmp = pandas.DataFrame({'Predicted':Enc.transform( pred_y),'Actual':Enc.transform(tst_y)}) acc = metrics.accuracy_score(tst_y,pred_y) print('Accuracy:’,acc) cmp.plot(kind='density*) Accuracy: 0.435156900860872

Out[5]:

1], s=300, linewidth=l, facecolors='none') ax.set_xlim(xlim) ax.set_ylim(ylim)

First of all we get the model., ax (axes) and plot_support (to plot support vectors or not) as arguments and parameters. Then we start off with the 2-D graph plot and if we don't pass the axes we will find them using the gca() function. We also find the x axis limit and y axis limit using the get_xlim() and get_ylim() function repectively and store them in xlim and ylim

169 SUPPORT VECTOR MACHINES ----- '•-----------------------------Then we create the grid where or the base of our plot using the xiim and ytim. As we did before we create values for the lines using the tinspaceO function and create the grid using the meshgrid() function and then use the vstackO function to vertically stack the arrays where the values are reshaped using the ravetC) function. We also call the decision_f unction (J to get the valyes for the boundaries and margins Next we use the data to plot the boundaries and margins using the contourO function to draw the lines and specify the linestyles and the other properties using the respective parameters Atlas, we check whether to plot the support vectors or not and plot them if to using the scatter() functions by using the support_vectors_ 0 and 1 indexed values in the array Finally we can plot our data clusters and call the MMH () function and pass our SVC model In [41]:

pyplot.scatter(Xl>X2>c=y) MMH(CModel)

Finally we have the maximum marginal hyperplane plotted for our data clusters with the support vectors

'

•/> 170 SUPPORT VECTOR MACHINES ----- '•------------------------------

Support Vector Machine Kernels Support vector machines are implemented with kernels that transforms a input data space into multidimensional for more flexiblity and smooth workflow for the support vectors machines. As in the previous model we used the linear kernel there are different types of kernels like: • Linear Kerenel, is used when predicting two outcomes • Polynomial kernel, is more generalized version of the linear kernel where the input space is non-linear • Radial Basis Function kernel, is used for SVM? s that maps the input space into infinite dimensions This time we will use the sample dataset provided by the sklearn to understand the different kernels. First of all we need to import our data and prepare it. Import the the followings

import numpy from sklearn import svc,datasets from matplotlib import pyptot In [1]:

import numpy from sklearn import svm., datasets from matplotlib import pyplot

We load the iris (sample iris flower dataset) dataset from the dataset In [2]:

dt = datasets.load_iris() X = dt.data[:, :2] y = dt.target

We imported the dataset and splitted it into input data[:,:2] (elements with 0-2 indexes in alt the arrays) and output target and stored them into X and y respectively

'

•/> 171 SUPPORT VECTOR MACHINES ----- '•------------------------------

Now we need the data to plot the SVM boundaries or input data spaces(different classes). To do so we need to create a grid as we did before. To plot the grid we need minimum and maximum values of the input and output datasets. Then we reshape them using ravet() and pass them to the c_() function which nn particular stacks arrays along their last axis after being upgraded to at least 2-D and stored in the X_ptot as testing input In [3]:

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1 y_min, y_max = X[:, l].min() - 1, X[:> l].max() + 1 h = (x_max / x_min)/100 xx, yy = numpy.meshgrid(numpy.arange(x_min, x_max, h), numpy.arange(y_min, y_max, h)) X_plot = numpy.c_[xx.ravel(), yy.ravel()]

We have the required data to train the SVC classifier, so let's create it In [4]:

SvcModel = svm.SVC(kernel='linear',C=1.0).fit(X, y)

We can create the SvcModel using the SVC() function and pass linear to the kernel parameter and 1.©(float) to the regularization C parameter. Also train it using the fit () method Now we can predict the X_plot values and store it in Z. We will reshape it using the reshape() function to the shape of the xx meshgrid First of all we will plot the figure(base). Then we will add a subplot and draw the filled contours using the subplot(121) and contourfO function. We passed the values created with the meshgridO function and Z predicted values. Now we can plot the data clusters using the scatten() plot and finally limiting the x-axis to maximum and minimum values of the xx meshgrid We can see how our dataset is divided intom different spaces by the support vector classifier with linear kernel

In [5]:

Z = SvcModel.predict(X_plot) Z = Z.reshape(xx.shape) pyplot.figure(figsize=(15, 5)) pyplot.subplot(121) pyplot.contourf(xx, yy> Z, alpha=0.3) pyplot.scatter(X[:, 0], X[:, l],c=y) pyplot.xlim(xx.min(), xx.max())

0ut[5]:

(3.3, 8.882727272727251)

Similarly we can create a SVC using the Radial Basis Function kernel. In [10]:

RbfSvc = svm.SVC(kernel='rbf',C=1.0).fit(X, y) Z = RbfSvc.predict(X_plot) Z = Z.reshape(xx.shape) pyplot.figure(figsize=(15, 5)) pyplot.subplot(121) pyplot.contourf(xx, yy, Z,alpha=0.3) pyplot.scatter(X[:, 0], X[:, l],c=y) pyplot.xlim(xx.min(), xx.max())

(Output on the next Page) You can observe both of the plots using the linear and rbf and notice and notice a clear difference in lines and curves

Out[10]:

(3.3, 8.882727272727251)

10

------ j---------- 1------------------------- 1-------------------------1------------------------- 1-------------------------r-

4

5

6

7

8

1 Q CLUSTERING IO ALGORITHM • Clustering • K-Means algorithm • Mean shift algorithm • Heirarchical clustering

o

A

18 k ______/

CLUSTERING i__________________________________________ j

Clustering is a case of unsupervised machine learning. The clustering algorithms learns relations in the data and classifies it into groups according to whether number of groups provided with input or not

Clustering The followings are the different types of clustering: • Density-based, clusters are formed as dense regions. These algorithms have good accuracy and capibility to merge two clusters together. Like, Density-Based Spatial Clustering of Applications with Noise (DBSCAN), Ordering Points to identify Clustering structure (OPTICS) • Heirarchial-based, clusters are formed in a heirarchical tree which has Agglomerative (Bottom up approach) and Divisive (Top down approach). Like Clustering using Representatives (CURE), Balanced iterative Reducing Clustering using Hierarchies (BIRCH) • Partitioning, clusters are formed by partioning the objects into k, number of clusters will be equal to that of partitions. Like K-Means • Grid, clusters are formed as grid. This method is fast and independent on the number of objects. Like Statistical Information Grid (STING), Clustering in Quest (CLIQUE)

176

CLUSTERING •

Until now we have calculated the accuracy of supervised learning algorithms with the predicted values and actual values, but how can we do so for unsupervised learning algorithms when we are dealing with unlabeled data? There are some metrics that can be used to evaluate the performance or quality of different unsupervised learning algorithms by the changes in the clusters Silhouette analysis used to check the quality of clustering model by measuring the distance between the clusters. It basically provides us a way to assess the parameters like number of clusters with the help of Silhouette score The silhouette score ranges from -1 to 1. The different numbers represent the followings: • 1 is the situation when the cluster is far away from it's neighbouring cluster • 0 is the situation when the cluster is very close or on the decision boundary itself i.e. seperating the clusters • -1 is the situation when cluster aren't formed correctly the silhouette score can be calculated using the following formula: Silhouette Score = p-q/max(p,q)

where p is the mean distance to the points in the nearest cluster and q is the mean intra-cluster distance to all the points Next we can use the Davis-Bouldin Index to know whether the clusters are well spaced from each other or not and the density of the clusters. We can calculate the DB index using the following formula:

177

CLUSTERING •

where nis the number of clusters, o1 is the average distance of all points in cluster i from the cluster centroid ci. Lower values indicate good performance, where 0 is the minimum value

Dunn index is another metric that can be used to evaluate the performance of a clustering algorithm. It is similar to the DB index but the difference are: • It considers only the clusters close together whereas DB index considers all of the clusters • Lower Dunn indexes indicates bad performance whereas lower the DB index higher the performance of the algorithm The Dunn index can be calculated using the following formula: miniout_y)

Out[5]:

KNeighborsClassifier()

Now we can pass the test values to the KNN classifier and print the accuracy score In [6]:

pred_y = KNN.predict(tst_X) acc = accuracy_score(tst_y,pred_y) print('Accuracy:',acc)

Accuracy: 0.9335

Our model has 93% of accuracy which is more than the logistic regression model by 2%. So this was how we can create KNN algorithms to solve different problems of regression and clustering alike

PERFORMANCE CU 5 METRICS • Calculoting the model • Improving the model • Saving ond loading models

o

/------------------------------------------------------------------\

PERFORMANCE 8 METRICS

So far we have created a lot of models with different algorithms for different tasks like regression, classification, etc. and also evaluated their performance visually through graphs or their accuracy score. In this lesson we will look at the methods to calculate the performance of algorithms

Calculating the model In the maths for machine learning lesson we learned about some methods to calculate error rate, Precision, Recall and F-measure (page no. 71) using confusion matrix. All these values can be used to evaluate the performance of a classifier. Let's use the KNN classifier model we created previously and calculate it's performance First off all let's print the confusion matrix using the confusion_matrix0 function In [8]:

from sklearn import metrics cm = metrics.confusion_matrix(tst_y,pred_y) cm

Out[8]:

array([[ 657, 62], [ 80, 1201]], dtype=int64)

We passed the actual values followed by predicted values. We can visualize it better like In [11]:

from sklearn import metrics cm = metrics.confusion_matrix(tst_y,pred_y) cf = pandas.DataFrame({'True +ve':cm[:,0], 'True -ve':cm[:,1]}, index=['Predicted +ve', 'Predicted -ve']) cf

Out[ll]: Ttue +ve

Ttue -ve

Predicted +ve

657

62

Predicted -ve

80

1201

198

PERFORMANCE 8 METRICS

So we have 657 i.e. True Positives (Predicted Positive values {1, fYes-’, etc.} that are Positive too), 62 i.e. False Positives (Predicted Positive values that are Negative), 80 i.e. False Negatives (Predicted Negative values {0, fNoJ, etc.} that are Positive) and 1201 i.e. True Negatives (Predicted Negative values that are Negative too). Using the confusion matrix we can calculate other metrics like: In [14]:

from sklearn import metrics cm = metrics.confusion_matrix(tst_y,pred_y) cf = pandas.DataFrame({'True +ve':cm[:,0], 'True -ve':cm[:,1]}, index=['Predicted +ve', 'Predicted -ve']) val = (tst_y,pred_y)

acc pre rcl fms

= = = =

metrics.accuracy_score(*val) metrics.precision_score(*val) metrics.recall_score(*val) metrics.fl_score(*val)

print('Accuracy: \acc) print('Precision:',pre) print('Recall:'rcl) print('F-Measure:',fms)

Accuracy: 0.929 Precision: 0.950910530482977 Recall: 0.9375487900078064 F-Measure: 0.944182389937107

We have calculated the accuracy, precision (True positive values predicted by the model from total positive values predicted), recall (True positive values predicted by the model from actual positive values)and f-measure (also known as Fl score) using the accuracy_score(), precision_score(), recaUL_score() and fl_score() functions and printed them respectively respectively We can print all of them together in a tabular form using the ctassification_report() function

199

In [17]:

PERFORMANCE 8 METRICS

from sklearn import metrics

rep = metrics.classification_report(tst_y,pred_y) print(rep) precision

recall

fl-score

support

0 1

0.89 0.95

0.91 0.94

0.90 0.94

719 1281

accuracy macro avg weighted avg

0.92 0.93

0.93 0.93

0.93 0.92 0.93

2000 2000 2000

The support is the number of Positive values in the sample for the feature or label here 0 and 1 i.e. Rain or No Rain. Macro average stands for (macro*score of class 0 + macro*score of class 1 where macro is 0.5 here) and weighted average stands for (weighted score of class 0 + weighted score class 1 where the weight is mostly imbalanced). We can use these metrics to evaluate the performance of a algorithm. Then look at the metrics to evaluate a regression model. We will use the KNN regressor we created to predict In [6]:

from sklearn import metrics

val err mae mse rsq

= = = = =

(tst_y.» pred_y) metrics.max_error(*val) metrics.mean_absolute_error(*val) metrics.mean_squared_error(*val) metrics.r2_score(*val)

print('Max Error:'?err) print('MAE:',mae) print('MSE:',mse) print('R2:',rsq)

Max Error: 19.824399999999983 MAE: 7.167660000000001 MSE: 84.34250813599998 R2: 0.02620416439244655

' •/> 200 PERFORMANCE & METRICS ----- '•--------------------------We calculated the Maximum Error(Maximum residual error), MAE(Mean absolute error i.e. average vertical distance between each point and the regression line), MSE(mean of the squared distance from each point to the regression line) and R2 (Explained variation / Total variation) using the

max_error(), mean_absoLute_error(), mean_squared_error() and r2_score() functions and printed them repectively. The lesser the Max Error, MAE and MSE is the better the performance of the model is. Where R2 is a percentage i.e. more closer to 1.0 is more better. But a constant model like our's that always predicts the expected value of y, disregarding the input features, would get a R2 score closer to 0.0

Improving the model Upon calculating the metrics of an algorithm we can perform the following steps to improve the performance of our models: • Make sure to train the model with adequate data. The dataset shouldn't have abnormal distribution of features or labels like 5 samples of Yes and 95 samples of No • After loading data we should always practice the best and suitable preprocessing methods on our data to improve it's quality like encoding labels • We shouldn't save much data for testing but don't less too. For datasets with samples over 10k, 20% or less is adequate • In cases of very less data you can create random values for testing instead of splitting the already scarce data • You can always test different algorithms to solve a problem, compare them with their metrics and choose the best and improve it

201

PERFORMANCE 8 METRICS

Saving and loading models So we have created our model, tested it and even improved it. Let's say we want to use the model somewhere else or share it so, how to do that? Well, we can do so using the joblib module. So let's save our KNN weight predicting model using

joblib In [6]:

import joblib joblib.dump(KNN, "WeightPred.sav")

0ut[6]:

['WeightPred.sav']

We used the dumpO function and passed the KNN regressor adn the "WeightPred. sav" filename as arguments. Make sure to use the .sav extension after the model name. As we haven't specified any specific location it is stored in the place where jupyter notebook is hosted D sales_data.csv somefile.png

trees.csv WeightPred.sav

Now we can open a new jupyter notebook, import

jobtib and our model In [1]:

import joblib KNN = joblib.load("WeightPred.sav") KNN.predict([[70]])

Out[l]:

array([[133.4852]])

We imported our model using the T_oad() function and passed the saved model name. We also asked the model to predict the weight of a person with 70 inches of height and it passed 133.5 pounds

ML APPLICATION 1 • Movie Recommender

ML APPLICATION 1

□---------------Problem: You have to create a model who will suggest the genre for movies a person likes if the person's age, gender and previously watched movie genre is provided as input. Here is the dataset for sample recommendation: https://defmycode.cf/wp-content/uploads/2020/12/movies.csv

------------ [ data ]------------- '

So the first step is to decide which method to use? If the task is to recommend the genre of a movie, that is classify a person so we will use classification. Next, we need to decide which algorithm to use? We aren't dealing with a huge dataset so we can go with the decision tree classifier So let's start of by importing all the modules we need In [1]:

import pandas from sklearn.tree import DecisionTreeClassifier from sklearn.preprocessing import LabelEncoder

Now we can import the dataset and preview it using the headO function In [2]:

import pandas from sklearn.tree import DecisionTreeClassifier from sklearn.preprocessing import LabelEncoder dt = pandas.read_csv('movies.csv*) dt.head(3)

ML APPLICATION

Out[2]: Age

Gender

Watched

Genre

0

19

Male

Comdey

Mystery

1

19

Female

Romance

Drama

2

19

Male

Romance

Drama

As you can see we need to encode all of the labels into numeric values. We can use an encoder for the Gender labels and another encoder for Watched and Genre labels In [3]:

# Gender Encoder gndr_enc = LabelEncoder() gndr_enc.fit(['Male','Female'])

Out[3]:

LabelEncoder()

We created the gndr_enc Gender encoder and passed the Gender labels to the fit() method. Now we can create the Genre encoder. But before that we need all the unique Genre labels in both Watched and Genre column In [4]:

# Unique Genre Label, extraction watched = dt['Watched'].unique() genre = dt['Genre'].unique() Genres = [*watched] for ele in genre: if ele in Genres: continue else: Genres.append(ele)

First of all we extracted the unique labels from the Watched and Genre cloumns using the uniqueO method. Then we created another list variable and passed the Watched uniques (note that we need a single list i.e. 1-D that's why the watched list is unpacked by the * operator). Using the for loop we added the uniques of the Genre column labels that aren't present in the Genres list

ML APPLICATION

Now we can create our gnre_enc Genre encoder and fit the Genres In [5]:

# Unique Genre Label, extraction watched = dt['Watched'].unique() genre = dt['Genre'].unique() Genres = [*watched] for ele in genre: if ele in Genres: continue else: Genres.append(ele) # Genre Encoder gnre_enc = LabelEncoder() gnre_enc.fit(Genres)

Out[5]:

LabelEncoder()

All the encoders are ready so let's encode the labels in our dataset with them In [6]:

for col in ['Gender*,'Watched','Genre']: if col == 'Gender': # Gender Encode dt[col] = gndr_enc.transform(dt[col]) else: # Watched & Genre Encode dt[col] = gnre_enc.transform(dt[col])

Now we can divide our dataset into input and outputj so let's move onto another new cell because if you re-run the above cell it will cause error because the labels are encoded so when the above cell is executed again the encoder will recieve number insted of labels and cause error so move onto a new cell In [7]:

X = dt.drop(columns='Genre') y = dt['Genre'] CModel = DecisionTreeClassifier() CModel.fit(X,y)

0ut[7]:

DecisionTreeClassifier()



206

ML APPLICATION

We also have trained our CModel, and use it to make predictions. So let's create a function to pass the values and return the Genre label In [8]:

def recommend(age=18.,gnd=0,watched=0Jtest=False): # Getting input is testing if test: age = int(input("Age:")) gnd = int(input("Gender:")) for g in Genres: print(g, *gnre_enc.transform([g])) watched = int(input("Watched:")) # Ask the model, for recommendation pred = CModel.predict([[age,gnd,watched]]) # Decoding the prediction to LabeL rec = gnre_enc.inverse_transform(pred) return rec[0]

So we defined a recommend() function and defined four parameters i.e. age by default 18, gnd gender by default ©(female), watched genre of the previously watched movie by default O(Comedy) and test by default Fatse which we can use during testing to pass the input values Then if we pass test as True then the function will ask for our input and also display the encoded values for each genre. Then the model will predict using the input values. We will take the output(encoded value) and decode it and finally return it So let's move onto a new cell and call our recommend () function and specify the True for the test parameter In [*]:

recommend(test=True) Age:|18

You can see we are prompted to the input prompt called in the recommend 0 function. So let's pass 18 as the age

|



207

In [*]:

ML APPLICATION

recommend(test=True)

Age:18

Gender:

1

Pass the Gender as l(Male) In [*]:

recommend(test=True)

Age:18 Gender:1 Comedy - 0 Romance - 5 Horror - 3 Mystery - 4 Drama - 1 Fantasy - 2

Watched:[0

— You can see the function has displayed all the encoded values for each genre, so let-’s pass 0(Comedy) In [9]:

recommend(test=True)

Age:18 Gender:1 Comedy - 0 Romance - 5 Horror - 3 Mystery - 4 Drama - 1 Fantasy - 2 Watched:0 Out[9]:

'Mystery'

Now we get the Mystery as the recommendation for the 18 years-old Male who has watched a comedy movie previously. Well because we have very less data, so letJs check the answer visually using the dataset



208

ML APPLICATION

Out[2] : Age

Gender

Watched

Genre

0

19

Male

Comdey

Mystery

1

19

Female

Romance

Drama

2

19

Male

Romance

Drama

In the second run we printed the first three rows of our dataset and by looking we can say that if a 18 years old male (whose sample isn't present in the dataset) have previously watched a Comedy movie so most likely he'll like a Mystery movie too along with the Comedy movies So we have created our Movie Recommender Model using very little dataset, now it's up to you to test the model or even take opinions from your relatives and recommend them using the model!

ML APPLICATION 2 • Advertisement handling

J

Problem: You have to create a model to decide whether to show an ad to a user or not. If yes then which Car or Insurance advt. where the age of the user and user class i.e. a group provided from another model based on the user's past search results are provided as input. You are provided with the following dataset https://defmycode.cf/wp-content/uploads/2020/12/advertisement.csv

Once again we need to decide which method to use and clearly this is a problem of classification. We can use the KNN classifier algorithm for this task

So let's move onto jupyter notebook and import the algorithm, LabelEncoder, pandas library and the dataset In [1]:

import pandas from sklearn.neighbors import KNeighborsClassifier from sklearn.preprocessing import LabelEncoder dt = pandas.read_csv(1 advertisement.csv’) dt.head(3)

Age

Search

Ad

0

18

Cars

Car

1

19

Automobiles

None

2

21

Automobiles

None

211

ML APPLICATION

We again have very less data to work with. We have the Age and Search as input and Ad as output. But we need to preprocess the labels In [2]:

Enc = LabelEncoder() Enc.fit(['Cars','Automobiles’,'Health','Car', 'Insurance','None']) for col in ['Search','Ad']: dt[col] = Enc.transform(dt[col])

So we encoded the values in the Search and Ad column using the Enc Encoder. Now we can move onto the next step of data splitting but as mentioned earlier we don't have enough data for splitting it into training and testing set. So we need to create the training set ourselves . Let's take a look at the whole dataset: We have 20 rows and 3 columns worth of data. We already know the input i.e. Age and Search and the output i.e. Ad where the input Search will be provided by another classifier which will classify the user into Cars, Automobiles, Health and None classes on the basis of previous searches. We can visually create a understanding from the data that a person of age 18-28 should only be shown Car advertisement when the person is in Cars class else the Insurance advertisement when the person is in Health class. Similarly, a person of age 30 or more should be shown the Car advertisement if the person is in Cars or Automobiles class and vice versa

Age

Search

Ad

0

18

Cars

Car

1

19

Automobiles

None

2

21

Automobiles

None

3

22

Cars

Car

4

23

None

None

5

26

None

None

6

27

Health

Insurance

7

28

Cars

Car

8

18

Health

Insurance

9

19

None

None

10

20

None

None

11

22

Automobiles

None

12

26

None

Insurance

13

30

Health

Insurance

14

30

Automobiles

Car

15

29

Cars

Car

16

29

Health

Insurance

17

29

None

Insurance

18

32

Automobiles

Car

19

32

Health

Insurance

' 212•/ -

ML APPLICATION

>

And person from the 18-24 in the None class should be shown nothing but if the person is 25 or more the Insurance ad should be shown All this assumed conditions are called hypothesis. So using these hypothesis we can create a testing dataset with close to accurate outputs. So let's create them In [3]:

import numpy X = dt.drop(columns='Ad') y = dt['Ad']

def en(val): v = Enc.transform([val]) return v[0]

tst_X = numpy.array( [[21.»en( 'None')]_, [21,en( 'Health')], [27,en('None')],[23,en('Automobiles')], [34,en('None')],[34,en('Automobiles')]]) tst_y = numpy.array( [[en('None')]}[en('Insurance')]} [en('Insurance')],[en('None')], [en('Insurance')],[en('Car')]])

First of all we imported the numpy package to create our test data. Then we divided our dataset into training input and training output To create the testing set we will use the en() function which will take the Search or Ad label and return the encoded value for it. Now we can create some testing data based on our hypothesis, like a 21-years old person of class None ([21,en( rNone')]) should be shown no advertisements ([en( rNone')]) As mentioned earlier the training set is built upon hypothesis. They may be correct or wrong. We created them for the purpose of testing our model. We can only use this in situations like these where the data is compressed into a small dataset. We can visualiza the dataset using the pandas data frame



213

In [4]:

ML APPLICATION

def dec(val): v = Enc.inverse_transform(val) return v tst = pandas.DataFrame({'Agetst_X[:0], 'Search':dec(tst_X[:,1]), 'Ad':dec(tst_y[:,0])}) tst

Age

Search

Ad

0

21

None

None

1

21

Health

Insurance

2

27

None

Insurance

3

23

Automobiles

None

4

34

None

Insurance

5

34

Automobiles

Car

We defined dec() function to minimize our code. The ages in the testing set are not present in the actual dataset. Now we can move onto creating our KNN classifier and training it In [5]:

KNN = KNeighborsClassifier() KNN.fit(X,y)

Out[5]:

KNeighborsClassifier()

Now we can pass the test input to KNN classifier and print the accuracy In [6]: from sklearn.metrics import accuracy_score pred_y = KNN.predict(tst_X) accuracy_score(tst_y, pred_y) Out[6]:

0.8333333333333334

So our model has an accuracy of approx 83% and given the number of testing set length i.e. 6, our model has predicted correct for 5 inputs but wrong for only one



214

ML APPLICATION •

But remember that the testing set is based upon the hypothesis so maybe what the model predicted is right Also I don't know you have wondered about it until now or not but you can see that the training set was based upon the hypothesis we created i.e. we analyzed the data, found connections in the features & labels and created the testing set of which the model is thinking the same as us for the 5 inputs. All the hypothesis we built are the same patters and relations used by the model to predict. Even though we have done that just having a thorough look at the data which is most likely not to be wrong, but the model does everything in some milliseconds. Think that there were tens of thousands of data like these! Could you have done the same there? I hope your understanding about 'machine' and 'learning' in machine learning is more clear now

OQ CO

ML APPLICATION 3

• Checking wine quality

o

Problem: In a wine factory, you are asked to rate the quality of the production in a scale of 1 to 5 if different chemical properties are passed as an input for the following produced batch and then tell whether the batch is good or not. A good scale is more than half (2.5). You have the following sample dataset of some 1500 samples /—C Input

and Output Values

https://defmycode.cf/wp-content/uploads/2020/12/wine_batch.csv https://defmycode.cf/wp-content/uploads/2020/12/wine

■{ batch]-

X.

sample

We can use machine learning models to solve the problem but the question is which algorithm to choose? If you are think of using an classifier algorithm because we need to rate the wine then of course you're wrong. Rating is to be done in a scale of 1 to 10 where the rating can be 5 or 5.5 or even 5.45, so for this problem we are going to use linear regression So let's our sample dataset and preview it with the describeO function along with the other neccesities In [1]:

import pandas import numpy from sklearn import metrics from sklearn.linear_model import LinearRegression from sklearn.model_selection import train_test_split dt = pandas.read_csv('wine_sample.csv') (dt.describe()).round(1)

217

ML APPLICATION

Out[l]: fixed acidity

volatile acidity

citric acid

residual sugar

chlorides

count

1499.0

1499.0

1499.0

1499.0

1499.0

mean

8.4

0.5

0.3

2.5

0.1

std

1.7

0.2

0.2

1.4

0.0

min

4.6

0.1

0.0

0.9

0.0

25%

7.2

0.4

0.1

1.9

0.1

50%

8.0

0.5

0.3

2.2

0.1

75%

9.3

0.6

0.4

2.6

0.1

max

15.9

1.6

1.0

15.5

0.6

free sulfur dioxide

total sulfur dioxide

density

PH

sulphates

alcohol

quality

1499.0

1499.0

1499.0

1499.0

1499.0

1499.0

1499.0

15.6

46.8

1.0

3.3

0.7

10.4

5.6

10.5

33.3

0.0

0.2

0.2

1.1

0.8

1.0

6.0

1.0

2.7

0.3

8.4

3.0

7.0

22.0

1.0

3.2

0.6

9.5

5.0

13.0

38.0

1.0

3.3

0.6

10.1

6.0

21.0

63.0

1.0

3.4

0.7

11.1

6.0

72.0

289.0

1.0

4.0

2.0

14.9

8.0

In this dataset we have twelve columns which have about 1500 samples. The first eleven columns are different chemical properties of wine i.e. input and quality is the rating i.e. output. The minumum rating is 3 and the maximum is 8. But we need to rate the quality of wine in a scale of 1 to 5. So we need to Rescale the quality feature in the scale of 1 to 5 and we will do that using the

MinMaxScater In [2]:

from sklearn.preprocessing import MinMaxScaler Sclr = MinMaxScaler(feature_range=(l, 5)) qal = numpy.array(dt['quality']) dt['quality'] = Sclr.fit_transform(qal.reshape(-l,l))



218

ML APPLICATION

We imported the MinMaxScater and created our Sctr object of the class. We passed the scale in the feature_range parameter i.e. 1-5. Then we created a numpy array of the quality feature. Then we scaled the data using the fit_transform() function. Note that we passed the reshaped array using the reshape(-1,1) function which will convert the 1-D array [5,6,7,...] to 2-D array [[5],[6],[7],...] Now we can split the data, create our linear regressor and train it In [3]:

X = dt.drop(columns='quality') y = dt['quality'] trnX,tstX,trnY,tstY = train_test_split(X,y,test_size=0.1) Reg = LinearRegression() Reg.fit(trnX,trnY)

Out[3]:

LinearRegression()

Before checking the quality of the given batch we need to test our data and find some metrics. So let's use the testing sets and compare the model's predictions In [4]:

= Reg.predict(tstX) metrics.mean_absolute_error(tstY,predY) metrics.max_error(tstY,predY) pandas.DataFrame({'Predicted':predY, 'Actual':tstY.values}) print ('MAE: ’,mae, '\n','Max RE: ’,err) cmp.plot(figsize=(7.5,6))

predY mae = err = cmp =

MAE: 0.43020099299063136 Max RE: 1.5037184643992099

Our model has MAE (Mean absolute error) of approx 0.43 i.e. the average aboslute errors with the maximum residual error i.e. Max error as 1.5 We have a lot of values so let's plot the graph for comparing the values



219

0ut[4]:

ML APPLICATION

cmatplotlib.axes._subplots.AxesSubplot at 0x23bl0ef1160>

Looking at the data we can tell how our model is performing. By observing the graph we can tell that our model didn't rated 5 to any input whereas the actual values have only 3 times which explains everything. The distribution of higher values is low therefore prediction of higher rating is also low. Although our model is fine, so let's import the batch dataset and pass it to the model In [5]:

batch = pandas.read_csv('wine_batch.csv1) batch_pred = Reg.predict(batch.values) batch_pred.mean()

Out[5]:

3.135502047867703

We imported the csv file and passed the values for predictions. And at average the rating is 3.14 and if we take MAE(0.43) the rating could be also 2.71 or 3.57. But in all of the cases the average rating of the batch is higher than 2.5 so it is fine!

p/l CH

ml

APPLICATION 4

• Motch ploy decision