Python: Real-World Data Science 1786465167, 9781786465160

Unleash the power of Python and its robust data science capabilities About This Book * Unleash the power of Python 3 obj

430 96 23MB

English Pages 1255 Year 2016

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Cover
Meet Your Course Guide
Table of Contents
Introduction and First Steps – Take a Deep Breath
A proper introduction
Enter the Python
About Python
Portability
Coherence
Developer productivity
An extensive library
Software quality
Software integration
Satisfaction and enjoyment
What are the drawbacks?
Who is using Python today?
Setting up the environment
Python 2 versus Python 3 – the great debate
What you need for this course
Installing Python
Installing IPython
Installing additional packages
How you can run a Python program
Running Python scripts
Running the Python interactive shell
Running Python as a service
Running Python as a GUI application
How is Python code organized
How do we use modules and packages
Python's execution model
Names and namespaces
Scopes
Guidelines on how to write good code
The Python culture
A note on the IDEs
Object-oriented Design
Introducing object-oriented
Objects and classes
Specifying attributes and behaviors
Data describes objects
Behaviors are actions
Hiding details and creating the public interface
Composition
Inheritance
Inheritance provides abstraction
Multiple inheritance
Case study
Objects in Python
Creating Python classes
Adding attributes
Making it do something
Talking to yourself
More arguments
Initializing the object
Explaining yourself
Modules and packages
Organizing the modules
Absolute imports
Relative imports
Organizing module contents
Who can access my data?
Third-party libraries
Case study
When Objects Are Alike
Basic inheritance
Extending built-ins
Overriding and super
Multiple inheritance
The diamond problem
Different sets of arguments
Polymorphism
Abstract base classes
Using an abstract base class
Creating an abstract base class
Demystifying the magic
Case study
Expecting the Unexpected
Raising exceptions
Raising an exception
The effects of an exception
Handling exceptions
The exception hierarchy
Defining our own exceptions
Case study
When to Use Object-oriented Programming
Treat objects as objects
Adding behavior to class data with properties
Properties in detail
Decorators – another way to create properties
Deciding when to use properties
Manager objects
Removing duplicate code
In practice
Case study
Python Data Structures
Empty objects
Tuples and named tuples
Named tuples
Dictionaries
Dictionary use cases
Using defaultdict
Counter
Lists
Sorting lists
Sets
Extending built-ins
Queues
FIFO queues
LIFO queues
Priority queues
Case study
Python Object-oriented Shortcuts
Python built-in functions
The len() function
Reversed
Enumerate
File I/O
Placing it in context
An alternative to method overloading
Default arguments
Variable argument lists
Unpacking arguments
Functions are objects too
Using functions as attributes
Callable objects
Case study
Strings and Serialization
Strings
String manipulation
String formatting
Escaping braces
Keyword arguments
Container lookups
Object lookups
Making it look right
Strings are Unicode
Converting bytes to text
Converting text to bytes
Mutable byte strings
Regular expressions
Matching patterns
Matching a selection of characters
Escaping characters
Matching multiple characters
Grouping patterns together
Getting information from regular expressions
Making repeated regular expressions efficient
Serializing objects
Customizing pickles
Serializing web objects
Case study
The Iterator Pattern
Design patterns in brief
Iterators
The iterator protocol
Comprehensions
List comprehensions
Set and dictionary comprehensions
Generator expressions
Generators
Yield items from another iterable
Coroutines
Back to log parsing
Closing coroutines and throwing exceptions
The relationship between coroutines, generators, and functions
Case study
Python Design Patterns I
The decorator pattern
A decorator example
Decorators in Python
The observer pattern
An observer example
The strategy pattern
A strategy example
Strategy in Python
The state pattern
A state example
State versus strategy
State transition as coroutines
The singleton pattern
Singleton implementation
The template pattern
A template example
Python Design Patterns II
The adapter pattern
The facade pattern
The flyweight pattern
The command pattern
The abstract factory pattern
The composite pattern
Testing Object-oriented Programs
Why test?
Test-driven development
Unit testing
Assertion methods
Reducing boilerplate and cleaning up
Organizing and running tests
Ignoring broken tests
Testing with py.test
One way to do setup and cleanup
A completely different way to set up variables
Skipping tests with py.test
Imitating expensive objects
How much testing is enough?
Case study
Implementing it
Concurrency
Threads
The many problems with threads
Shared memory
The global interpreter lock
Thread overhead
Multiprocessing
Multiprocessing pools
Queues
The problems with multiprocessing
Futures
AsyncIO
AsyncIO in action
Reading an AsyncIO future
AsyncIO for networking
Using executors to wrap blocking code
Streams
Executors
Case study
Introducing Data Analysis and Libraries
Data analysis and processing
An overview of the libraries in data analysis
Python libraries in data analysis
NumPy
pandas
Matplotlib
PyMongo
The scikit-learn library
NumPy Arrays and Vectorized Computation
NumPy arrays
Data types
Array creation
Indexing and slicing
Fancy indexing
Numerical operations on arrays
Array functions
Data processing using arrays
Loading and saving data
Saving an array
Loading an array
Linear algebra with NumPy
NumPy random numbers
Data Analysis with pandas
An overview of the pandas package
The pandas data structure
Series
The DataFrame
The essential basic functionality
Reindexing and altering labels
Head and tail
Binary operations
Functional statistics
Function application
Sorting
Indexing and selecting data
Computational tools
Working with missing data
Advanced uses of pandas for data analysis
Hierarchical indexing
The Panel data
Data Visualization
The matplotlib API primer
Line properties
Figures and subplots
Exploring plot types
Scatter plots
Bar plots
Contour plots
Histogram plots
Legends and annotations
Plotting functions with pandas
Additional Python data visualization tools
Bokeh
MayaVi
Time Series
Time series primer
Working with date and time objects
Resampling time series
Downsampling time series data
Upsampling time series data
Timedeltas
Time series plotting
Interacting with Databases
Interacting with data in text format
Reading data from text format
Writing data to text format
Interacting with data in binary format
HDF5
Interacting with data in MongoDB
Interacting with data in Redis
The simple value
List
Set
Ordered set
Data Analysis Application Examples
Data munging
Cleaning data
Filtering
Merging data
Reshaping data
Data aggregation
Grouping data
Getting Started with Data Mining
Introducing data mining
A simple affinity analysis example
What is affinity analysis?
Product recommendations
Loading the dataset with NumPy
Implementing a simple ranking of rules
Ranking to find the best rules
A simple classification example
What is classification?
Loading and preparing the dataset
Implementing the OneR algorithm
Testing the algorithm
Classifying with scikit-learn Estimators
scikit-learn estimators
Nearest neighbors
Distance metrics
Loading the dataset
Moving towards a standard workflow
Running the algorithm
Setting parameters
Preprocessing using pipelines
An example
Standard preprocessing
Putting it all together
Pipelines
Predicting Sports Winners with Decision Trees
Loading the dataset
Collecting the data
Using pandas to load the dataset
Cleaning up the dataset
Extracting new features
Decision trees
Parameters in decision trees
Using decision trees
Sports outcome prediction
Putting it all together
Random forests
How do ensembles work?
Parameters in Random forests
Applying Random forests
Engineering new features
Recommending Movies Using Affinity Analysis
Affinity analysis
Algorithms for affinity analysis
Choosing parameters
The movie recommendation problem
Obtaining the dataset
Loading with pandas
Sparse data formats
The Apriori implementation
The Apriori algorithm
Implementation
Extracting association rules
Evaluation
Extracting Features with Transformers
Feature extraction
Representing reality in models
Common feature patterns
Creating good features
Feature selection
Selecting the best individual features
Feature creation
Creating your own transformer
The transformer API
Implementation details
Unit testing
Putting it all together
Social Media Insight Using Naive Bayes
Disambiguation
Downloading data from a social network
Loading and classifying the dataset
Creating a replicable dataset from Twitter
Text transformers
Bag-of-words
N-grams
Other features
Naive Bayes
Bayes' theorem
Naive Bayes algorithm
How it works
Application
Extracting word counts
Converting dictionaries to a matrix
Training the Naive Bayes classifier
Putting it all together
Evaluation using the F1-score
Getting useful features from models
Discovering Accounts to Follow Using Graph Mining
Loading the dataset
Classifying with an existing model
Getting follower information from Twitter
Building the network
Creating a graph
Creating a similarity graph
Finding subgraphs
Connected components
Optimizing criteria
Beating CAPTCHAs with Neural Networks
Artificial neural networks
An introduction to neural networks
Creating the dataset
Drawing basic CAPTCHAs
Splitting the image into individual letters
Creating a training dataset
Adjusting our training dataset to our methodology
Training and classifying
Back propagation
Predicting words
Improving accuracy using a dictionary
Ranking mechanisms for words
Putting it all together
Authorship Attribution
Attributing documents to authors
Applications and use cases
Attributing authorship
Getting the data
Function words
Counting function words
Classifying with function words
Support vector machines
Classifying with SVMs
Kernels
Character n-grams
Extracting character n-grams
Using the Enron dataset
Accessing the Enron dataset
Creating a dataset loader
Putting it all together
Evaluation
Clustering News Articles
Obtaining news articles
Using a Web API to get data
Reddit as a data source
Getting the data
Extracting text from arbitrary websites
Finding the stories in arbitrary websites
Putting it all together
Grouping news articles
The k-means algorithm
Evaluating the results
Extracting topic information from clusters
Using clustering algorithms as transformers
Clustering ensembles
Evidence accumulation
How it works
Implementation
Online learning
An introduction to online learning
Implementation
Classifying Objects in Images Using Deep Learning
Object classification
Application scenario and goals
Use cases
Deep neural networks
Intuition
Implementation
An introduction to Theano
An introduction to Lasagne
Implementing neural networks with nolearn
GPU optimization
When to use GPUs for computation
Running our code on a GPU
Setting up the environment
Application
Getting the data
Creating the neural network
Putting it all together
Working with Big Data
Big data
Application scenario and goals
MapReduce
Intuition
A word count example
Hadoop MapReduce
Application
Getting the data
Naive Bayes prediction
The mrjob package
Extracting the blog posts
Training Naive Bayes
Putting it all together
Training on Amazon's EMR infrastructure
Next Steps…
Chapter 1 – Getting Started with Data Mining
Scikit-learn tutorials
Extending the IPython Notebook
Chapter 2 – Classifying with scikit-learn Estimators
More complex pipelines
Comparing classifiers
Chapter 3: Predicting Sports Winners with Decision Trees
More on pandas
Chapter 4 – Recommending Movies Using Affinity Analysis
The Eclat algorithm
Chapter 5 – Extracting Features with Transformers
Vowpal Wabbit
Chapter 6 – Social Media Insight Using Naive Bayes
Natural language processing and part-of-speech tagging
Chapter 7 – Discovering Accounts to Follow Using Graph Mining
More complex algorithms
Chapter 8 – Beating CAPTCHAs with Neural Networks
Deeper networks
Reinforcement learning
Chapter 9 – Authorship Attribution
Local n-grams
Chapter 10 – Clustering News Articles
Real-time clusterings
Chapter 11 – Classifying Objects in Images Using Deep Learning
Keras and Pylearn2
Mahotas
Chapter 12 – Working with Big Data
Courses on Hadoop
Pydoop
Recommendation engine
More resources
Giving Computers the Ability to Learn from Data
How to transform data into knowledge
The three different types of machine learning
Making predictions about the future with supervised learning
Classification for predicting class labels
Regression for predicting continuous outcomes
Solving interactive problems with reinforcement learning
Discovering hidden structures with unsupervised learning
Finding subgroups with clustering
Dimensionality reduction for data compression
An introduction to the basic terminology and notations
A roadmap for building machine learning systems
Preprocessing – getting data into shape
Training and selecting a predictive model
Evaluating models and predicting unseen data instances
Using Python for machine learning
Training Machine Learning Algorithms for Classification
Artificial neurons – a brief glimpse into the early history of machine learning
Implementing a perceptron learning algorithm in Python
Training a perceptron model on the Iris dataset
Adaptive linear neurons and the convergence of learning
Minimizing cost functions with gradient descent
Implementing an Adaptive Linear Neuron in Python
Large scale machine learning and stochastic gradient descent
A Tour of Machine Learning Classifiers Using scikit-learn
Choosing a classification algorithm
First steps with scikit-learn
Training a perceptron via scikit-learn
Modeling class probabilities via logistic regression
Logistic regression intuition and conditional probabilities
Learning the weights of the logistic cost function
Training a logistic regression model with scikit-learn
Tackling overfitting via regularization
Maximum margin classification with support vector machines
Maximum margin intuition
Dealing with the nonlinearly separable case using slack variables
Alternative implementations in scikit-learn
Solving nonlinear problems using a kernel SVM
Using the kernel trick to find separating hyperplanes in higher dimensional space
Decision tree learning
Maximizing information gain – getting the most bang for the buck
Building a decision tree
Combining weak to strong learners via random forests
K-nearest neighbors – a lazy learning algorithm
Building Good Training Sets – Data Preprocessing
Dealing with missing data
Eliminating samples or features with missing values
Imputing missing values
Understanding the scikit-learn estimator API
Handling categorical data
Mapping ordinal features
Encoding class labels
Performing one-hot encoding on nominal features
Partitioning a dataset in training and test sets
Bringing features onto the same scale
Selecting meaningful features
Sparse solutions with L1 regularization
Sequential feature selection algorithms
Assessing feature importance with random forests
Compressing Data via Dimensionality Reduction
Unsupervised dimensionality reduction via principal component analysis
Total and explained variance
Feature transformation
Principal component analysis in scikit-learn
Supervised data compression via linear discriminant analysis
Computing the scatter matrices
Selecting linear discriminants for the new feature subspace
Projecting samples onto the new feature space
LDA via scikit-learn
Using kernel principal component analysis for nonlinear mappings
Kernel functions and the kernel trick
Implementing a kernel principal component analysis in Python
Example 1 – separating half-moon shapes
Example 2 – separating concentric circles
Projecting new data points
Kernel principal component analysis in scikit-learn
Learning Best Practices for Model Evaluation and Hyperparameter Tuning
Streamlining workflows with pipelines
Loading the Breast Cancer Wisconsin dataset
Combining transformers and estimators in a pipeline
Using k-fold cross-validation to assess model performance
The holdout method
K-fold cross-validation
Debugging algorithms with learning and validation curves
Diagnosing bias and variance problems with learning curves
Addressing overfitting and underfitting with validation curves
Fine-tuning machine learning models via grid search
Tuning hyperparameters via grid search
Algorithm selection with nested cross-validation
Looking at different performance evaluation metrics
Reading a confusion matrix
Optimizing the precision and recall of a classification model
Plotting a receiver operating characteristic
The scoring metrics for multiclass classification
Combining Different Models for Ensemble Learning
Learning with ensembles
Implementing a simple majority vote classifier
Combining different algorithms for classification with majority vote
Evaluating and tuning the ensemble classifier
Bagging – building an ensemble of classifiers from bootstrap samples
Leveraging weak learners via adaptive boosting
Predicting Continuous Target Variables with Regression Analysis
Introducing a simple linear regression model
Exploring the Housing Dataset
Visualizing the important characteristics of a dataset
Implementing an ordinary least squares linear regression model
Solving regression for regression parameters with gradient descent
Estimating the coefficient of a regression model via scikit-learn
Fitting a robust regression model using RANSAC
Evaluating the performance of linear regression models
Using regularized methods for regression
Turning a linear regression model into a curve – polynomial regression
Modeling nonlinear relationships in the Housing Dataset
Dealing with nonlinear relationships using random forests
Decision tree regression
Random forest regression
Reflect and Test Yourself! Answers
Module 2: Data Analysis
Chapter 1: Introducing Data Analysis and Libraries
Chapter 2: Object-oriented Design
Chapter 3: Data Analysis with pandas
Chapter 4: Data Visualization
Chapter 5: Time Series
Chapter 6: Interacting with Databases
Chapter 7: Data Analysis Application Examples
Module 3: Data Mining
Chapter 1: Getting Started with Data Mining
Chapter 2: Classifying with scikit-learn Estimators
Chapter 3: Predicting Sports Winners with Decision Trees
Chapter 4: Recommending Movies Using Affinity Analysis
Chapter 5: Extracting Features with Transformers
Chapter 6: Social Media Insight Using Naive Bayes
Chapter 7: Discovering Accounts to Follow Using Graph Mining
Chapter 8: Beating CAPTCHAs with Neural Networks
Chapter 9: Authorship Attribution
Chapter 10: Clustering News Articles
Chapter 11: Classifying Objects in Images Using Deep Learning
Chapter 12: Working with Big Data
Module 4: Machine Learning
Chapter 1: Giving Computers the Ability to Learn from Data
Chapter 2: Training Machine Learning
Chapter 3: A Tour of Machine Learning Classifiers Using scikit-learn
Chapter 4: Building Good Training Sets – Data Preprocessing
Chapter 5: Compressing Data via Dimensionality Reduction
Chapter 6: Learning Best Practices for Model Evaluation and Hyperparameter Tuning
Chapter 7: Combining Different Models for Ensemble Learning
Chapter 8: Predicting Continuous Target Variables with Regression Analysis
Recommend Papers

Python: Real-World Data Science
 1786465167, 9781786465160

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

!

R

20 16

EW

N

FO

Python Real-World Data Science

CURATED COURSE

Python Real-World Data Science A course in four modules

Unleash the power of Python and its robust data science capabilities with your Course Guide Ankita Thakur

Learn to use powerful Python libraries for effective data processing and analysis

To contact your Course Guide Email: [email protected]

BIRMINGHAM - MUMBAI

Meet Your Course Guide

Hello and welcome to this Data Science with Python course. You now have a clear pathway from learning Python core features right through to getting acquainted with the concepts and techniques of the data science field—all using Python! This course has been planned and created for you by me Ankita Thakur – I am your Course Guide, and I am here to help you have a great journey along the pathways of learning that I have planned for you. I've developed and created this course for you and you'll be seeing me through the whole journey, offering you my thoughts and ideas behind what you're going to learn next and why I recommend each step. I'll provide tests and quizzes to help you reflect on your learning, and code challenges that will be pitched just right for you through the course. If you have any questions along the way, you can reach out to me over e-mail or telephone and I'll make sure you get everything from the course that we've planned – for you to start your career in the field of data science. Details of how to contact me are included on the first page of this course.

What's so cool about Data Science?

What is Data Science and why is there so much of buzz about this in the world? Is it of great importance? Well, the following sentence will answer all such questions: "This hot new field promises to revolutionize industries from business to government, health care to academia." – The New York Times The world is generating data at an increasing pace. Consumers, sensors, or scientific experiments emit data points every day. In finance, business, administration, and the natural or social sciences, working with data can make up a significant part of the job. Being able to efficiently work with small or large datasets has become a valuable skill. Also, we live in a world of connected things where tons of data is generated and it is humanly impossible to analyze all the incoming data and make decisions. Human decisions are increasingly replaced by decisions made by computers. Thanks to the field of Data Science!

Data science has penetrated deeply in our connected world and there is a growing demand in the market for people who not only understand data science algorithms thoroughly, but are also capable of programming these algorithms. A field that is at the intersection of many fields, including data mining, machine learning, and statistics, to name a few. This puts an immense burden on all levels of data scientists; from the one who is aspiring to become a data scientist and those who are currently practitioners in this field. Treating these algorithms as a black box and using them in decision-making systems will lead to counterproductive results. With tons of algorithms and innumerable problems out there, it requires a good grasp of the underlying algorithms in order to choose the best one for any given problem. Python as a programming language has evolved over the years and today, it is the number one choice for a data scientist. Python has become the most popular programming language for data science because it allows us to forget about the tedious parts of programming and offers us an environment where we can quickly jot down our ideas and put concepts directly into action. It has been used in industry for a long time, but it has been popular among researchers as well. In contrast to more specialized applications and environments, Python is not only about data analysis. The list of industrial-strength libraries for many general computing tasks is long, which makes working with data in Python even more compelling. Whether your data lives inside SQL or NoSQL databases or is out there on the Web and must be crawled or scraped first, the Python community has already developed packages for many of those tasks.

Course Structure

Frankly speaking, it's a wise decision to know the nitty-gritty of Python as it's a trending language. I'm sure you'll gain lot of knowledge through this course and be able to implement all those in practice. However, I want to highlight that the road ahead may be bumpy on occasions, and some topics may be more challenging than others, but I hope that you will embrace this opportunity and focus on the reward. Remember that we are on this journey together, and throughout this course, we will add many powerful techniques to your arsenal that will help us solve even the toughest problems the data-driven way.

I've created this learning path for you that consist of four models. Each of these modules are a mini-course in their own way, and as you complete each one, you'll have gained key skills and be ready for the material in the next module.

So let's now look at the pathway these modules create—basically all the topics that will be exploring in this learning journey.

Course Journey

We start the course with our very first module, Python Fundamentals, to help you get familiar with Python. Installing Python correctly is equal to half job done. This module starts with the installation of Python, IPython, and all the necessary packages. Then, we'll see the fundamentals of object-oriented programming because Python itself is an object-oriented programming language. Finally, we'll make friends with some of the core concepts of Python—how to get Python programming basics nailed down.

Then we'll move towards the analysis part. The second module, Data Analysis, will get you started with Python data analysis in a practical and example-driven way. You'll see how we can use Python libraries for effective data processing and analysis. So, if you want to to get started with basic data processing tasks or time series, then you can find lot of hands-on knowledge in the examples of this module.

The third module, Data Mining, is designed in a way that you have a good understanding of the basics, some best practices to jump into solving problems with data mining, and some pointers on the next steps you can take. Now, you can harness the power of Python to analyze data and create insightful predictive models.

Finally, we'll move towards exploring more advanced topics. Sometimes an analysis task is too complex to program by hand. Machine learning is a modern technique that enables computers to discover patterns and draw conclusions for themselves. The aim of our fourth module, Machine Learning, is to provide you with a module where we'll discuss the necessary details regarding machine learning concepts, offering intuitive yet informative explanations on how machine learning algorithms work, how to use them, and most importantly, how to avoid the common pitfalls. So, if you want to become a machine-learning practitioner, a better problem solver, or maybe even consider a career in machine learning research, I'm sure there is lot for you in this module!

The Course Roadmap and Timeline

Here's a view of the entire course plan before we begin. This grid gives you a topic overview of the whole course and its modules, so you can see how we will move through particular phases of learning to use Python, what skills you'll be learning along the way, and what you can do with those skills at each point. I also offer you an estimate of the time you might want to take for each module, although a lot depends on your learning style how much you're able to give the course each week!

Table of Contents Course Module 1: Python Fundamentals Chapter 1: Introduction and First Steps – Take a Deep Breath

5

A proper introduction 6 Enter the Python 8 About Python 9 Portability 9 Coherence 9 Developer productivity 9 An extensive library 10 Software quality 10 Software integration 10 Satisfaction and enjoyment 10 What are the drawbacks? 11 Who is using Python today? 11 Setting up the environment 11 Python 2 versus Python 3 – the great debate 12 What you need for this course 13 Installing Python 14 Installing IPython 14 Installing additional packages 16 How you can run a Python program 17 Running Python scripts 18 Running the Python interactive shell 18 Running Python as a service 20 Running Python as a GUI application 20 How is Python code organized 21 How do we use modules and packages 22 [i]

Table of Contents

Python's execution model 25 Names and namespaces 25 Scopes 27 Guidelines on how to write good code 30 The Python culture 31 A note on the IDEs 33

Chapter 2: Object-oriented Design

35

Chapter 3: Objects in Python

63

Introducing object-oriented 35 Objects and classes 37 Specifying attributes and behaviors 39 Data describes objects 40 Behaviors are actions 41 Hiding details and creating the public interface 43 Composition 45 Inheritance 48 Inheritance provides abstraction 50 Multiple inheritance 51 Case study 52 Creating Python classes Adding attributes Making it do something

63 65 66

Talking to yourself More arguments

66 67

Initializing the object Explaining yourself Modules and packages Organizing the modules

69 71 73 76

Organizing module contents Who can access my data? Third-party libraries Case study

79 82 84 85

Absolute imports Relative imports

76 77

Chapter 4: When Objects Are Alike Basic inheritance Extending built-ins Overriding and super

[ ii ]

97

97 100 101

Table of Contents

Multiple inheritance 103 The diamond problem 105 Different sets of arguments 110 Polymorphism 113 Abstract base classes 116 Using an abstract base class 116 Creating an abstract base class 117 Demystifying the magic 119 Case study 120

Chapter 5: Expecting the Unexpected

137

Chapter 6: When to Use Object-oriented Programming

167

Chapter 7: Python Data Structures

199

Raising exceptions Raising an exception The effects of an exception Handling exceptions The exception hierarchy Defining our own exceptions Case study

Treat objects as objects Adding behavior to class data with properties Properties in detail Decorators – another way to create properties Deciding when to use properties Manager objects Removing duplicate code In practice Case study

138 139 141 142 148 149 154 167 171 174 176 178 180 182 184 187

Empty objects 199 Tuples and named tuples 201 Named tuples 203 Dictionaries 204 Dictionary use cases 208 Using defaultdict 208 Counter 210

Lists 211 Sorting lists 213 Sets 217 Extending built-ins 221 [ iii ]

Table of Contents

Queues 226 FIFO queues 227 LIFO queues 229 Priority queues 230 Case study 232

Chapter 8: Python Object-oriented Shortcuts

241

Chapter 9: Strings and Serialization

273

Python built-in functions 241 The len() function 242 Reversed 242 Enumerate 244 File I/O 245 Placing it in context 247 An alternative to method overloading 249 Default arguments 250 Variable argument lists 252 Unpacking arguments 256 Functions are objects too 257 Using functions as attributes 261 Callable objects 262 Case study 263 Strings 273 String manipulation 274 String formatting 276 Escaping braces Keyword arguments Container lookups Object lookups Making it look right

277 278 279 280 281

Strings are Unicode

283

Mutable byte strings Regular expressions Matching patterns

287 288 289

Converting bytes to text Converting text to bytes

284 285

Matching a selection of characters Escaping characters Matching multiple characters Grouping patterns together

Getting information from regular expressions Making repeated regular expressions efficient

[ iv ]

290 291 292 293

294

296

Table of Contents

Serializing objects Customizing pickles Serializing web objects Case study

296 298 301 304

Chapter 10: The Iterator Pattern

313

Chapter 11: Python Design Patterns I

345

Chapter 12: Python Design Patterns II

375

Design patterns in brief 313 Iterators 314 The iterator protocol 315 Comprehensions 317 List comprehensions 317 Set and dictionary comprehensions 320 Generator expressions 321 Generators 323 Yield items from another iterable 325 Coroutines 328 Back to log parsing 331 Closing coroutines and throwing exceptions 333 The relationship between coroutines, generators, and functions 334 Case study 335 The decorator pattern A decorator example Decorators in Python The observer pattern An observer example The strategy pattern A strategy example Strategy in Python The state pattern A state example State versus strategy State transition as coroutines The singleton pattern Singleton implementation The template pattern A template example The adapter pattern The facade pattern

[v]

345 346 349 351 352 354 355 357 357 358 364 364 364 365 369 369 375 379

Table of Contents

The flyweight pattern The command pattern The abstract factory pattern The composite pattern

381 385 390 395

Chapter 13: Testing Object-oriented Programs

403

Chapter 14: Concurrency

441

Why test? Test-driven development Unit testing Assertion methods Reducing boilerplate and cleaning up Organizing and running tests Ignoring broken tests Testing with py.test One way to do setup and cleanup A completely different way to set up variables Skipping tests with py.test Imitating expensive objects How much testing is enough? Case study Implementing it

403 405 406 408 410 411 412 414 416 419 423 424 428 431 432

Threads 442 The many problems with threads 445 Shared memory The global interpreter lock

445 446

Thread overhead 447 Multiprocessing 447 Multiprocessing pools 449 Queues 452 The problems with multiprocessing 454 Futures 454 AsyncIO 457 AsyncIO in action 458 Reading an AsyncIO future 459 AsyncIO for networking 460 Using executors to wrap blocking code 463 Streams 465 Executors 465

Case study

466

[ vi ]

Table of Contents

Course Module 2: Data Analysis Chapter 1: Introducing Data Analysis and Libraries

479

Chapter 2: NumPy Arrays and Vectorized Computation

491

Chapter 3: Data Analysis with pandas

511

Data analysis and processing 480 An overview of the libraries in data analysis 483 Python libraries in data analysis 486 NumPy 486 pandas 487 Matplotlib 487 PyMongo 487 The scikit-learn library 488 NumPy arrays Data types Array creation Indexing and slicing Fancy indexing Numerical operations on arrays Array functions Data processing using arrays Loading and saving data Saving an array Loading an array Linear algebra with NumPy NumPy random numbers

492 492 494 496 497 498 499 501 503 503 504 504 506

An overview of the pandas package 511 The pandas data structure 512 Series 512 The DataFrame 514 The essential basic functionality 518 Reindexing and altering labels 518 Head and tail 519 Binary operations 520 Functional statistics 521 Function application 523 Sorting 524 Indexing and selecting data 526 Computational tools 528 Working with missing data 529 [ vii ]

Table of Contents

Advanced uses of pandas for data analysis Hierarchical indexing The Panel data

532 532 535

Chapter 4: Data Visualization

539

Chapter 5: Time Series

563

Chapter 6: Interacting with Databases

587

Chapter 7: Data Analysis Application Examples

607

The matplotlib API primer 540 Line properties 543 Figures and subplots 545 Exploring plot types 548 Scatter plots 548 Bar plots 549 Contour plots 550 Histogram plots 551 Legends and annotations 552 Plotting functions with pandas 556 Additional Python data visualization tools 559 Bokeh 559 MayaVi 560 Time series primer 563 Working with date and time objects 564 Resampling time series 572 Downsampling time series data 573 Upsampling time series data 575 Timedeltas 579 Time series plotting 580

Interacting with data in text format 587 Reading data from text format 587 Writing data to text format 592 Interacting with data in binary format 593 HDF5 594 Interacting with data in MongoDB 595 Interacting with data in Redis 600 The simple value 600 List 601 Set 602 Ordered set 603 Data munging Cleaning data

[ viii ]

608 610

Table of Contents

Filtering 613 Merging data 616 Reshaping data 620 Data aggregation 621 Grouping data 624

Course Module 3: Data Mining Chapter 1: Getting Started with Data Mining

633

Chapter 2: Classifying with scikit-learn Estimators

653

Chapter 3: Predicting Sports Winners with Decision Trees

671

Introducing data mining A simple affinity analysis example What is affinity analysis? Product recommendations Loading the dataset with NumPy Implementing a simple ranking of rules Ranking to find the best rules A simple classification example What is classification? Loading and preparing the dataset Implementing the OneR algorithm Testing the algorithm

634 635 635 636 636 638 641 644 644 645 646 649

scikit-learn estimators 653 Nearest neighbors 654 Distance metrics 655 Loading the dataset 657 Moving towards a standard workflow 659 Running the algorithm 660 Setting parameters 661 Preprocessing using pipelines 664 An example 665 Standard preprocessing 665 Putting it all together 666 Pipelines 666 Loading the dataset Collecting the data Using pandas to load the dataset

[ ix ]

671 672 673

Table of Contents

Cleaning up the dataset Extracting new features Decision trees Parameters in decision trees Using decision trees Sports outcome prediction Putting it all together Random forests How do ensembles work? Parameters in Random forests Applying Random forests Engineering new features

674 675 677 678 679 680 680 684 685 686 686 688

Chapter 4: Recommending Movies Using Affinity Analysis

691

Chapter 5: Extracting Features with Transformers

711

Affinity analysis 691 Algorithms for affinity analysis 692 Choosing parameters 693 The movie recommendation problem 694 Obtaining the dataset 694 Loading with pandas 694 Sparse data formats 695 The Apriori implementation 696 The Apriori algorithm 698 Implementation 699 Extracting association rules 702 Evaluation 706 Feature extraction Representing reality in models Common feature patterns Creating good features Feature selection Selecting the best individual features Feature creation Creating your own transformer The transformer API Implementation details Unit testing Putting it all together

[x]

712 712 714 717 718 720 724 726 727 728 729 731

Table of Contents

Chapter 6: Social Media Insight Using Naive Bayes

733

Disambiguation 734 Downloading data from a social network 735 Loading and classifying the dataset 737 Creating a replicable dataset from Twitter 742 Text transformers 746 Bag-of-words 747 N-grams 748 Other features 749 Naive Bayes 749 Bayes' theorem 750 Naive Bayes algorithm 751 How it works 752 Application 753 Extracting word counts 754 Converting dictionaries to a matrix 755 Training the Naive Bayes classifier 755 Putting it all together 756 Evaluation using the F1-score 757 Getting useful features from models 758

Chapter 7: Discovering Accounts to Follow Using Graph Mining 763 Loading the dataset Classifying with an existing model Getting follower information from Twitter Building the network Creating a graph Creating a similarity graph Finding subgraphs Connected components Optimizing criteria

Chapter 8: Beating CAPTCHAs with Neural Networks Artificial neural networks An introduction to neural networks Creating the dataset Drawing basic CAPTCHAs Splitting the image into individual letters Creating a training dataset Adjusting our training dataset to our methodology

[ xi ]

764 765 767 770 773 775 779 780 783

791 792 793 795 796 797 799 801

Table of Contents

Training and classifying Back propagation Predicting words Improving accuracy using a dictionary Ranking mechanisms for words Putting it all together

802 804 805 810 810 811

Chapter 9: Authorship Attribution

815

Chapter 10: Clustering News Articles

841

Attributing documents to authors 816 Applications and use cases 816 Attributing authorship 817 Getting the data 819 Function words 822 Counting function words 823 Classifying with function words 824 Support vector machines 825 Classifying with SVMs 827 Kernels 827 Character n-grams 828 Extracting character n-grams 829 Using the Enron dataset 830 Accessing the Enron dataset 831 Creating a dataset loader 831 Putting it all together 836 Evaluation 837 Obtaining news articles Using a Web API to get data Reddit as a data source Getting the data Extracting text from arbitrary websites Finding the stories in arbitrary websites Putting it all together Grouping news articles The k-means algorithm Evaluating the results Extracting topic information from clusters Using clustering algorithms as transformers Clustering ensembles Evidence accumulation How it works [ xii ]

841 842 845 846 848 849 850 852 853 856 858 859 860 860 864

Table of Contents

Implementation 865 Online learning 866 An introduction to online learning 866 Implementation 867

Chapter 11: Classifying Objects in Images Using Deep Learning 873 Object classification 874 Application scenario and goals 874 Use cases 877 Deep neural networks 878 Intuition 879 Implementation 879 An introduction to Theano 880 An introduction to Lasagne 881 Implementing neural networks with nolearn 886 GPU optimization 890 When to use GPUs for computation 891 Running our code on a GPU 892 Setting up the environment 893 Application 896 Getting the data 896 Creating the neural network 897 Putting it all together 899

Chapter 12: Working with Big Data

903

Big data 904 Application scenario and goals 905 MapReduce 907 Intuition 907 A word count example 909 Hadoop MapReduce 910 Application 911 Getting the data 912 Naive Bayes prediction 914 The mrjob package Extracting the blog posts Training Naive Bayes Putting it all together Training on Amazon's EMR infrastructure

Chapter 13: Next Steps…

Chapter 1 – Getting Started with Data Mining Scikit-learn tutorials Extending the IPython Notebook [ xiii ]

914 914 917 920 925

931 931 931 932

Table of Contents

Chapter 2 – Classifying with scikit-learn Estimators 932 More complex pipelines 932 Comparing classifiers 932 Chapter 3: Predicting Sports Winners with Decision Trees 933 More on pandas 933 Chapter 4 – Recommending Movies Using Affinity Analysis 933 The Eclat algorithm 933 Chapter 5 – Extracting Features with Transformers 934 Vowpal Wabbit 934 Chapter 6 – Social Media Insight Using Naive Bayes 934 Natural language processing and part-of-speech tagging 934 Chapter 7 – Discovering Accounts to Follow Using Graph Mining 934 More complex algorithms 934 Chapter 8 – Beating CAPTCHAs with Neural Networks 935 Deeper networks 935 Reinforcement learning 935 Chapter 9 – Authorship Attribution 935 Local n-grams 935 Chapter 10 – Clustering News Articles 936 Real-time clusterings 936 Chapter 11 – Classifying Objects in Images Using Deep Learning 936 Keras and Pylearn2 936 Mahotas 936 Chapter 12 – Working with Big Data 937 Courses on Hadoop 937 Pydoop 937 Recommendation engine 937 More resources 937

Course Module 4: Machine Learning Chapter 1: Giving Computers the Ability to Learn from Data How to transform data into knowledge The three different types of machine learning Making predictions about the future with supervised learning Classification for predicting class labels Regression for predicting continuous outcomes

Solving interactive problems with reinforcement learning

[ xiv ]

943 944 944 945

945 946

948

Table of Contents

Discovering hidden structures with unsupervised learning

948

An introduction to the basic terminology and notations A roadmap for building machine learning systems Preprocessing – getting data into shape Training and selecting a predictive model Evaluating models and predicting unseen data instances Using Python for machine learning

951 953 954 954 955 955

Finding subgroups with clustering Dimensionality reduction for data compression

Chapter 2: Training Machine Learning Algorithms for Classification

Artificial neurons – a brief glimpse into the early history of machine learning Implementing a perceptron learning algorithm in Python Training a perceptron model on the Iris dataset Adaptive linear neurons and the convergence of learning Minimizing cost functions with gradient descent Implementing an Adaptive Linear Neuron in Python Large scale machine learning and stochastic gradient descent

Chapter 3: A Tour of Machine Learning Classifiers Using scikit-learn

Choosing a classification algorithm First steps with scikit-learn Training a perceptron via scikit-learn Modeling class probabilities via logistic regression Logistic regression intuition and conditional probabilities Learning the weights of the logistic cost function Training a logistic regression model with scikit-learn Tackling overfitting via regularization Maximum margin classification with support vector machines Maximum margin intuition Dealing with the nonlinearly separable case using slack variables Alternative implementations in scikit-learn Solving nonlinear problems using a kernel SVM Using the kernel trick to find separating hyperplanes in higher dimensional space Decision tree learning Maximizing information gain – getting the most bang for the buck Building a decision tree Combining weak to strong learners via random forests K-nearest neighbors – a lazy learning algorithm [ xv ]

949 949

959 960 966 969 975 976 978 984

991

991 992 992 998 998 1002 1004 1007 1011 1012 1013 1016 1017 1019 1022 1024 1030 1032 1034

Table of Contents

Chapter 4: Building Good Training Sets – Data Preprocessing

1041

Chapter 5: Compressing Data via Dimensionality Reduction

1071

Dealing with missing data Eliminating samples or features with missing values Imputing missing values Understanding the scikit-learn estimator API Handling categorical data Mapping ordinal features Encoding class labels Performing one-hot encoding on nominal features Partitioning a dataset in training and test sets Bringing features onto the same scale Selecting meaningful features Sparse solutions with L1 regularization Sequential feature selection algorithms Assessing feature importance with random forests

Unsupervised dimensionality reduction via principal component analysis Total and explained variance Feature transformation Principal component analysis in scikit-learn Supervised data compression via linear discriminant analysis Computing the scatter matrices Selecting linear discriminants for the new feature subspace Projecting samples onto the new feature space LDA via scikit-learn Using kernel principal component analysis for nonlinear mappings Kernel functions and the kernel trick Implementing a kernel principal component analysis in Python Example 1 – separating half-moon shapes Example 2 – separating concentric circles

Projecting new data points Kernel principal component analysis in scikit-learn

Chapter 6: Learning Best Practices for Model Evaluation and Hyperparameter Tuning Streamlining workflows with pipelines Loading the Breast Cancer Wisconsin dataset Combining transformers and estimators in a pipeline Using k-fold cross-validation to assess model performance The holdout method [ xvi ]

1041 1043 1044 1044 1046 1047 1047 1049 1050 1052 1054 1055 1061 1067

1072 1073 1077 1079 1082 1084 1087 1089 1090 1092 1092 1098

1099 1103

1106 1110

1113 1113 1114 1115 1117 1117

Table of Contents

K-fold cross-validation Debugging algorithms with learning and validation curves Diagnosing bias and variance problems with learning curves Addressing overfitting and underfitting with validation curves Fine-tuning machine learning models via grid search Tuning hyperparameters via grid search Algorithm selection with nested cross-validation Looking at different performance evaluation metrics Reading a confusion matrix Optimizing the precision and recall of a classification model Plotting a receiver operating characteristic The scoring metrics for multiclass classification

1119 1123 1124 1127 1129 1130 1131 1133 1134 1135 1137 1141

Chapter 7: Combining Different Models for Ensemble Learning 1145 Learning with ensembles Implementing a simple majority vote classifier Combining different algorithms for classification with majority vote Evaluating and tuning the ensemble classifier Bagging – building an ensemble of classifiers from bootstrap samples Leveraging weak learners via adaptive boosting

Chapter 8: Predicting Continuous Target Variables with Regression Analy