430 96 23MB
English Pages 1255 Year 2016
!
R
20 16
EW
N
FO
Python Real-World Data Science
CURATED COURSE
Python Real-World Data Science A course in four modules
Unleash the power of Python and its robust data science capabilities with your Course Guide Ankita Thakur
Learn to use powerful Python libraries for effective data processing and analysis
To contact your Course Guide Email: [email protected]
BIRMINGHAM - MUMBAI
Meet Your Course Guide
Hello and welcome to this Data Science with Python course. You now have a clear pathway from learning Python core features right through to getting acquainted with the concepts and techniques of the data science field—all using Python! This course has been planned and created for you by me Ankita Thakur – I am your Course Guide, and I am here to help you have a great journey along the pathways of learning that I have planned for you. I've developed and created this course for you and you'll be seeing me through the whole journey, offering you my thoughts and ideas behind what you're going to learn next and why I recommend each step. I'll provide tests and quizzes to help you reflect on your learning, and code challenges that will be pitched just right for you through the course. If you have any questions along the way, you can reach out to me over e-mail or telephone and I'll make sure you get everything from the course that we've planned – for you to start your career in the field of data science. Details of how to contact me are included on the first page of this course.
What's so cool about Data Science?
What is Data Science and why is there so much of buzz about this in the world? Is it of great importance? Well, the following sentence will answer all such questions: "This hot new field promises to revolutionize industries from business to government, health care to academia." – The New York Times The world is generating data at an increasing pace. Consumers, sensors, or scientific experiments emit data points every day. In finance, business, administration, and the natural or social sciences, working with data can make up a significant part of the job. Being able to efficiently work with small or large datasets has become a valuable skill. Also, we live in a world of connected things where tons of data is generated and it is humanly impossible to analyze all the incoming data and make decisions. Human decisions are increasingly replaced by decisions made by computers. Thanks to the field of Data Science!
Data science has penetrated deeply in our connected world and there is a growing demand in the market for people who not only understand data science algorithms thoroughly, but are also capable of programming these algorithms. A field that is at the intersection of many fields, including data mining, machine learning, and statistics, to name a few. This puts an immense burden on all levels of data scientists; from the one who is aspiring to become a data scientist and those who are currently practitioners in this field. Treating these algorithms as a black box and using them in decision-making systems will lead to counterproductive results. With tons of algorithms and innumerable problems out there, it requires a good grasp of the underlying algorithms in order to choose the best one for any given problem. Python as a programming language has evolved over the years and today, it is the number one choice for a data scientist. Python has become the most popular programming language for data science because it allows us to forget about the tedious parts of programming and offers us an environment where we can quickly jot down our ideas and put concepts directly into action. It has been used in industry for a long time, but it has been popular among researchers as well. In contrast to more specialized applications and environments, Python is not only about data analysis. The list of industrial-strength libraries for many general computing tasks is long, which makes working with data in Python even more compelling. Whether your data lives inside SQL or NoSQL databases or is out there on the Web and must be crawled or scraped first, the Python community has already developed packages for many of those tasks.
Course Structure
Frankly speaking, it's a wise decision to know the nitty-gritty of Python as it's a trending language. I'm sure you'll gain lot of knowledge through this course and be able to implement all those in practice. However, I want to highlight that the road ahead may be bumpy on occasions, and some topics may be more challenging than others, but I hope that you will embrace this opportunity and focus on the reward. Remember that we are on this journey together, and throughout this course, we will add many powerful techniques to your arsenal that will help us solve even the toughest problems the data-driven way.
I've created this learning path for you that consist of four models. Each of these modules are a mini-course in their own way, and as you complete each one, you'll have gained key skills and be ready for the material in the next module.
So let's now look at the pathway these modules create—basically all the topics that will be exploring in this learning journey.
Course Journey
We start the course with our very first module, Python Fundamentals, to help you get familiar with Python. Installing Python correctly is equal to half job done. This module starts with the installation of Python, IPython, and all the necessary packages. Then, we'll see the fundamentals of object-oriented programming because Python itself is an object-oriented programming language. Finally, we'll make friends with some of the core concepts of Python—how to get Python programming basics nailed down.
Then we'll move towards the analysis part. The second module, Data Analysis, will get you started with Python data analysis in a practical and example-driven way. You'll see how we can use Python libraries for effective data processing and analysis. So, if you want to to get started with basic data processing tasks or time series, then you can find lot of hands-on knowledge in the examples of this module.
The third module, Data Mining, is designed in a way that you have a good understanding of the basics, some best practices to jump into solving problems with data mining, and some pointers on the next steps you can take. Now, you can harness the power of Python to analyze data and create insightful predictive models.
Finally, we'll move towards exploring more advanced topics. Sometimes an analysis task is too complex to program by hand. Machine learning is a modern technique that enables computers to discover patterns and draw conclusions for themselves. The aim of our fourth module, Machine Learning, is to provide you with a module where we'll discuss the necessary details regarding machine learning concepts, offering intuitive yet informative explanations on how machine learning algorithms work, how to use them, and most importantly, how to avoid the common pitfalls. So, if you want to become a machine-learning practitioner, a better problem solver, or maybe even consider a career in machine learning research, I'm sure there is lot for you in this module!
The Course Roadmap and Timeline
Here's a view of the entire course plan before we begin. This grid gives you a topic overview of the whole course and its modules, so you can see how we will move through particular phases of learning to use Python, what skills you'll be learning along the way, and what you can do with those skills at each point. I also offer you an estimate of the time you might want to take for each module, although a lot depends on your learning style how much you're able to give the course each week!
Table of Contents Course Module 1: Python Fundamentals Chapter 1: Introduction and First Steps – Take a Deep Breath
5
A proper introduction 6 Enter the Python 8 About Python 9 Portability 9 Coherence 9 Developer productivity 9 An extensive library 10 Software quality 10 Software integration 10 Satisfaction and enjoyment 10 What are the drawbacks? 11 Who is using Python today? 11 Setting up the environment 11 Python 2 versus Python 3 – the great debate 12 What you need for this course 13 Installing Python 14 Installing IPython 14 Installing additional packages 16 How you can run a Python program 17 Running Python scripts 18 Running the Python interactive shell 18 Running Python as a service 20 Running Python as a GUI application 20 How is Python code organized 21 How do we use modules and packages 22 [i]
Table of Contents
Python's execution model 25 Names and namespaces 25 Scopes 27 Guidelines on how to write good code 30 The Python culture 31 A note on the IDEs 33
Chapter 2: Object-oriented Design
35
Chapter 3: Objects in Python
63
Introducing object-oriented 35 Objects and classes 37 Specifying attributes and behaviors 39 Data describes objects 40 Behaviors are actions 41 Hiding details and creating the public interface 43 Composition 45 Inheritance 48 Inheritance provides abstraction 50 Multiple inheritance 51 Case study 52 Creating Python classes Adding attributes Making it do something
63 65 66
Talking to yourself More arguments
66 67
Initializing the object Explaining yourself Modules and packages Organizing the modules
69 71 73 76
Organizing module contents Who can access my data? Third-party libraries Case study
79 82 84 85
Absolute imports Relative imports
76 77
Chapter 4: When Objects Are Alike Basic inheritance Extending built-ins Overriding and super
[ ii ]
97
97 100 101
Table of Contents
Multiple inheritance 103 The diamond problem 105 Different sets of arguments 110 Polymorphism 113 Abstract base classes 116 Using an abstract base class 116 Creating an abstract base class 117 Demystifying the magic 119 Case study 120
Chapter 5: Expecting the Unexpected
137
Chapter 6: When to Use Object-oriented Programming
167
Chapter 7: Python Data Structures
199
Raising exceptions Raising an exception The effects of an exception Handling exceptions The exception hierarchy Defining our own exceptions Case study
Treat objects as objects Adding behavior to class data with properties Properties in detail Decorators – another way to create properties Deciding when to use properties Manager objects Removing duplicate code In practice Case study
138 139 141 142 148 149 154 167 171 174 176 178 180 182 184 187
Empty objects 199 Tuples and named tuples 201 Named tuples 203 Dictionaries 204 Dictionary use cases 208 Using defaultdict 208 Counter 210
Lists 211 Sorting lists 213 Sets 217 Extending built-ins 221 [ iii ]
Table of Contents
Queues 226 FIFO queues 227 LIFO queues 229 Priority queues 230 Case study 232
Chapter 8: Python Object-oriented Shortcuts
241
Chapter 9: Strings and Serialization
273
Python built-in functions 241 The len() function 242 Reversed 242 Enumerate 244 File I/O 245 Placing it in context 247 An alternative to method overloading 249 Default arguments 250 Variable argument lists 252 Unpacking arguments 256 Functions are objects too 257 Using functions as attributes 261 Callable objects 262 Case study 263 Strings 273 String manipulation 274 String formatting 276 Escaping braces Keyword arguments Container lookups Object lookups Making it look right
277 278 279 280 281
Strings are Unicode
283
Mutable byte strings Regular expressions Matching patterns
287 288 289
Converting bytes to text Converting text to bytes
284 285
Matching a selection of characters Escaping characters Matching multiple characters Grouping patterns together
Getting information from regular expressions Making repeated regular expressions efficient
[ iv ]
290 291 292 293
294
296
Table of Contents
Serializing objects Customizing pickles Serializing web objects Case study
296 298 301 304
Chapter 10: The Iterator Pattern
313
Chapter 11: Python Design Patterns I
345
Chapter 12: Python Design Patterns II
375
Design patterns in brief 313 Iterators 314 The iterator protocol 315 Comprehensions 317 List comprehensions 317 Set and dictionary comprehensions 320 Generator expressions 321 Generators 323 Yield items from another iterable 325 Coroutines 328 Back to log parsing 331 Closing coroutines and throwing exceptions 333 The relationship between coroutines, generators, and functions 334 Case study 335 The decorator pattern A decorator example Decorators in Python The observer pattern An observer example The strategy pattern A strategy example Strategy in Python The state pattern A state example State versus strategy State transition as coroutines The singleton pattern Singleton implementation The template pattern A template example The adapter pattern The facade pattern
[v]
345 346 349 351 352 354 355 357 357 358 364 364 364 365 369 369 375 379
Table of Contents
The flyweight pattern The command pattern The abstract factory pattern The composite pattern
381 385 390 395
Chapter 13: Testing Object-oriented Programs
403
Chapter 14: Concurrency
441
Why test? Test-driven development Unit testing Assertion methods Reducing boilerplate and cleaning up Organizing and running tests Ignoring broken tests Testing with py.test One way to do setup and cleanup A completely different way to set up variables Skipping tests with py.test Imitating expensive objects How much testing is enough? Case study Implementing it
403 405 406 408 410 411 412 414 416 419 423 424 428 431 432
Threads 442 The many problems with threads 445 Shared memory The global interpreter lock
445 446
Thread overhead 447 Multiprocessing 447 Multiprocessing pools 449 Queues 452 The problems with multiprocessing 454 Futures 454 AsyncIO 457 AsyncIO in action 458 Reading an AsyncIO future 459 AsyncIO for networking 460 Using executors to wrap blocking code 463 Streams 465 Executors 465
Case study
466
[ vi ]
Table of Contents
Course Module 2: Data Analysis Chapter 1: Introducing Data Analysis and Libraries
479
Chapter 2: NumPy Arrays and Vectorized Computation
491
Chapter 3: Data Analysis with pandas
511
Data analysis and processing 480 An overview of the libraries in data analysis 483 Python libraries in data analysis 486 NumPy 486 pandas 487 Matplotlib 487 PyMongo 487 The scikit-learn library 488 NumPy arrays Data types Array creation Indexing and slicing Fancy indexing Numerical operations on arrays Array functions Data processing using arrays Loading and saving data Saving an array Loading an array Linear algebra with NumPy NumPy random numbers
492 492 494 496 497 498 499 501 503 503 504 504 506
An overview of the pandas package 511 The pandas data structure 512 Series 512 The DataFrame 514 The essential basic functionality 518 Reindexing and altering labels 518 Head and tail 519 Binary operations 520 Functional statistics 521 Function application 523 Sorting 524 Indexing and selecting data 526 Computational tools 528 Working with missing data 529 [ vii ]
Table of Contents
Advanced uses of pandas for data analysis Hierarchical indexing The Panel data
532 532 535
Chapter 4: Data Visualization
539
Chapter 5: Time Series
563
Chapter 6: Interacting with Databases
587
Chapter 7: Data Analysis Application Examples
607
The matplotlib API primer 540 Line properties 543 Figures and subplots 545 Exploring plot types 548 Scatter plots 548 Bar plots 549 Contour plots 550 Histogram plots 551 Legends and annotations 552 Plotting functions with pandas 556 Additional Python data visualization tools 559 Bokeh 559 MayaVi 560 Time series primer 563 Working with date and time objects 564 Resampling time series 572 Downsampling time series data 573 Upsampling time series data 575 Timedeltas 579 Time series plotting 580
Interacting with data in text format 587 Reading data from text format 587 Writing data to text format 592 Interacting with data in binary format 593 HDF5 594 Interacting with data in MongoDB 595 Interacting with data in Redis 600 The simple value 600 List 601 Set 602 Ordered set 603 Data munging Cleaning data
[ viii ]
608 610
Table of Contents
Filtering 613 Merging data 616 Reshaping data 620 Data aggregation 621 Grouping data 624
Course Module 3: Data Mining Chapter 1: Getting Started with Data Mining
633
Chapter 2: Classifying with scikit-learn Estimators
653
Chapter 3: Predicting Sports Winners with Decision Trees
671
Introducing data mining A simple affinity analysis example What is affinity analysis? Product recommendations Loading the dataset with NumPy Implementing a simple ranking of rules Ranking to find the best rules A simple classification example What is classification? Loading and preparing the dataset Implementing the OneR algorithm Testing the algorithm
634 635 635 636 636 638 641 644 644 645 646 649
scikit-learn estimators 653 Nearest neighbors 654 Distance metrics 655 Loading the dataset 657 Moving towards a standard workflow 659 Running the algorithm 660 Setting parameters 661 Preprocessing using pipelines 664 An example 665 Standard preprocessing 665 Putting it all together 666 Pipelines 666 Loading the dataset Collecting the data Using pandas to load the dataset
[ ix ]
671 672 673
Table of Contents
Cleaning up the dataset Extracting new features Decision trees Parameters in decision trees Using decision trees Sports outcome prediction Putting it all together Random forests How do ensembles work? Parameters in Random forests Applying Random forests Engineering new features
674 675 677 678 679 680 680 684 685 686 686 688
Chapter 4: Recommending Movies Using Affinity Analysis
691
Chapter 5: Extracting Features with Transformers
711
Affinity analysis 691 Algorithms for affinity analysis 692 Choosing parameters 693 The movie recommendation problem 694 Obtaining the dataset 694 Loading with pandas 694 Sparse data formats 695 The Apriori implementation 696 The Apriori algorithm 698 Implementation 699 Extracting association rules 702 Evaluation 706 Feature extraction Representing reality in models Common feature patterns Creating good features Feature selection Selecting the best individual features Feature creation Creating your own transformer The transformer API Implementation details Unit testing Putting it all together
[x]
712 712 714 717 718 720 724 726 727 728 729 731
Table of Contents
Chapter 6: Social Media Insight Using Naive Bayes
733
Disambiguation 734 Downloading data from a social network 735 Loading and classifying the dataset 737 Creating a replicable dataset from Twitter 742 Text transformers 746 Bag-of-words 747 N-grams 748 Other features 749 Naive Bayes 749 Bayes' theorem 750 Naive Bayes algorithm 751 How it works 752 Application 753 Extracting word counts 754 Converting dictionaries to a matrix 755 Training the Naive Bayes classifier 755 Putting it all together 756 Evaluation using the F1-score 757 Getting useful features from models 758
Chapter 7: Discovering Accounts to Follow Using Graph Mining 763 Loading the dataset Classifying with an existing model Getting follower information from Twitter Building the network Creating a graph Creating a similarity graph Finding subgraphs Connected components Optimizing criteria
Chapter 8: Beating CAPTCHAs with Neural Networks Artificial neural networks An introduction to neural networks Creating the dataset Drawing basic CAPTCHAs Splitting the image into individual letters Creating a training dataset Adjusting our training dataset to our methodology
[ xi ]
764 765 767 770 773 775 779 780 783
791 792 793 795 796 797 799 801
Table of Contents
Training and classifying Back propagation Predicting words Improving accuracy using a dictionary Ranking mechanisms for words Putting it all together
802 804 805 810 810 811
Chapter 9: Authorship Attribution
815
Chapter 10: Clustering News Articles
841
Attributing documents to authors 816 Applications and use cases 816 Attributing authorship 817 Getting the data 819 Function words 822 Counting function words 823 Classifying with function words 824 Support vector machines 825 Classifying with SVMs 827 Kernels 827 Character n-grams 828 Extracting character n-grams 829 Using the Enron dataset 830 Accessing the Enron dataset 831 Creating a dataset loader 831 Putting it all together 836 Evaluation 837 Obtaining news articles Using a Web API to get data Reddit as a data source Getting the data Extracting text from arbitrary websites Finding the stories in arbitrary websites Putting it all together Grouping news articles The k-means algorithm Evaluating the results Extracting topic information from clusters Using clustering algorithms as transformers Clustering ensembles Evidence accumulation How it works [ xii ]
841 842 845 846 848 849 850 852 853 856 858 859 860 860 864
Table of Contents
Implementation 865 Online learning 866 An introduction to online learning 866 Implementation 867
Chapter 11: Classifying Objects in Images Using Deep Learning 873 Object classification 874 Application scenario and goals 874 Use cases 877 Deep neural networks 878 Intuition 879 Implementation 879 An introduction to Theano 880 An introduction to Lasagne 881 Implementing neural networks with nolearn 886 GPU optimization 890 When to use GPUs for computation 891 Running our code on a GPU 892 Setting up the environment 893 Application 896 Getting the data 896 Creating the neural network 897 Putting it all together 899
Chapter 12: Working with Big Data
903
Big data 904 Application scenario and goals 905 MapReduce 907 Intuition 907 A word count example 909 Hadoop MapReduce 910 Application 911 Getting the data 912 Naive Bayes prediction 914 The mrjob package Extracting the blog posts Training Naive Bayes Putting it all together Training on Amazon's EMR infrastructure
Chapter 13: Next Steps…
Chapter 1 – Getting Started with Data Mining Scikit-learn tutorials Extending the IPython Notebook [ xiii ]
914 914 917 920 925
931 931 931 932
Table of Contents
Chapter 2 – Classifying with scikit-learn Estimators 932 More complex pipelines 932 Comparing classifiers 932 Chapter 3: Predicting Sports Winners with Decision Trees 933 More on pandas 933 Chapter 4 – Recommending Movies Using Affinity Analysis 933 The Eclat algorithm 933 Chapter 5 – Extracting Features with Transformers 934 Vowpal Wabbit 934 Chapter 6 – Social Media Insight Using Naive Bayes 934 Natural language processing and part-of-speech tagging 934 Chapter 7 – Discovering Accounts to Follow Using Graph Mining 934 More complex algorithms 934 Chapter 8 – Beating CAPTCHAs with Neural Networks 935 Deeper networks 935 Reinforcement learning 935 Chapter 9 – Authorship Attribution 935 Local n-grams 935 Chapter 10 – Clustering News Articles 936 Real-time clusterings 936 Chapter 11 – Classifying Objects in Images Using Deep Learning 936 Keras and Pylearn2 936 Mahotas 936 Chapter 12 – Working with Big Data 937 Courses on Hadoop 937 Pydoop 937 Recommendation engine 937 More resources 937
Course Module 4: Machine Learning Chapter 1: Giving Computers the Ability to Learn from Data How to transform data into knowledge The three different types of machine learning Making predictions about the future with supervised learning Classification for predicting class labels Regression for predicting continuous outcomes
Solving interactive problems with reinforcement learning
[ xiv ]
943 944 944 945
945 946
948
Table of Contents
Discovering hidden structures with unsupervised learning
948
An introduction to the basic terminology and notations A roadmap for building machine learning systems Preprocessing – getting data into shape Training and selecting a predictive model Evaluating models and predicting unseen data instances Using Python for machine learning
951 953 954 954 955 955
Finding subgroups with clustering Dimensionality reduction for data compression
Chapter 2: Training Machine Learning Algorithms for Classification
Artificial neurons – a brief glimpse into the early history of machine learning Implementing a perceptron learning algorithm in Python Training a perceptron model on the Iris dataset Adaptive linear neurons and the convergence of learning Minimizing cost functions with gradient descent Implementing an Adaptive Linear Neuron in Python Large scale machine learning and stochastic gradient descent
Chapter 3: A Tour of Machine Learning Classifiers Using scikit-learn
Choosing a classification algorithm First steps with scikit-learn Training a perceptron via scikit-learn Modeling class probabilities via logistic regression Logistic regression intuition and conditional probabilities Learning the weights of the logistic cost function Training a logistic regression model with scikit-learn Tackling overfitting via regularization Maximum margin classification with support vector machines Maximum margin intuition Dealing with the nonlinearly separable case using slack variables Alternative implementations in scikit-learn Solving nonlinear problems using a kernel SVM Using the kernel trick to find separating hyperplanes in higher dimensional space Decision tree learning Maximizing information gain – getting the most bang for the buck Building a decision tree Combining weak to strong learners via random forests K-nearest neighbors – a lazy learning algorithm [ xv ]
949 949
959 960 966 969 975 976 978 984
991
991 992 992 998 998 1002 1004 1007 1011 1012 1013 1016 1017 1019 1022 1024 1030 1032 1034
Table of Contents
Chapter 4: Building Good Training Sets – Data Preprocessing
1041
Chapter 5: Compressing Data via Dimensionality Reduction
1071
Dealing with missing data Eliminating samples or features with missing values Imputing missing values Understanding the scikit-learn estimator API Handling categorical data Mapping ordinal features Encoding class labels Performing one-hot encoding on nominal features Partitioning a dataset in training and test sets Bringing features onto the same scale Selecting meaningful features Sparse solutions with L1 regularization Sequential feature selection algorithms Assessing feature importance with random forests
Unsupervised dimensionality reduction via principal component analysis Total and explained variance Feature transformation Principal component analysis in scikit-learn Supervised data compression via linear discriminant analysis Computing the scatter matrices Selecting linear discriminants for the new feature subspace Projecting samples onto the new feature space LDA via scikit-learn Using kernel principal component analysis for nonlinear mappings Kernel functions and the kernel trick Implementing a kernel principal component analysis in Python Example 1 – separating half-moon shapes Example 2 – separating concentric circles
Projecting new data points Kernel principal component analysis in scikit-learn
Chapter 6: Learning Best Practices for Model Evaluation and Hyperparameter Tuning Streamlining workflows with pipelines Loading the Breast Cancer Wisconsin dataset Combining transformers and estimators in a pipeline Using k-fold cross-validation to assess model performance The holdout method [ xvi ]
1041 1043 1044 1044 1046 1047 1047 1049 1050 1052 1054 1055 1061 1067
1072 1073 1077 1079 1082 1084 1087 1089 1090 1092 1092 1098
1099 1103
1106 1110
1113 1113 1114 1115 1117 1117
Table of Contents
K-fold cross-validation Debugging algorithms with learning and validation curves Diagnosing bias and variance problems with learning curves Addressing overfitting and underfitting with validation curves Fine-tuning machine learning models via grid search Tuning hyperparameters via grid search Algorithm selection with nested cross-validation Looking at different performance evaluation metrics Reading a confusion matrix Optimizing the precision and recall of a classification model Plotting a receiver operating characteristic The scoring metrics for multiclass classification
1119 1123 1124 1127 1129 1130 1131 1133 1134 1135 1137 1141
Chapter 7: Combining Different Models for Ensemble Learning 1145 Learning with ensembles Implementing a simple majority vote classifier Combining different algorithms for classification with majority vote Evaluating and tuning the ensemble classifier Bagging – building an ensemble of classifiers from bootstrap samples Leveraging weak learners via adaptive boosting
Chapter 8: Predicting Continuous Target Variables with Regression Analy