Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter [3 ed.] 109810403X, 9781098104030

Get the definitive handbook for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python

3,521 282 9MB

English Pages 580 [582] Year 2022

Table of contents :
Cover
Copyright
Table of Contents
Preface
Section 1. Conventions Used in This Book
Section 2. Using Code Examples
Section 3. O’Reilly Online Learning
Section 4. How to Contact Us
Section 5. Acknowledgments
In Memoriam: John D. Hunter (1968–2012)
Acknowledgments for the Third Edition (2022)
Acknowledgments for the Second Edition (2017)
Acknowledgments for the First Edition (2012)
Chapter 1. Preliminaries
1.1 What Is This Book About?
What Kinds of Data?
1.2 Why Python for Data Analysis?
Python as Glue
Solving the “Two-Language” Problem
Why Not Python?
1.3 Essential Python Libraries
NumPy
pandas
matplotlib
IPython and Jupyter
SciPy
scikit-learn
statsmodels
Other Packages
1.4 Installation and Setup
Miniconda on Windows
GNU/Linux
Miniconda on macOS
Installing Necessary Packages
Integrated Development Environments and Text Editors
1.5 Community and Conferences
1.6 Navigating This Book
Code Examples
Data for Examples
Import Conventions
Chapter 2. Python Language Basics, IPython, and Jupyter Notebooks
2.1 The Python Interpreter
2.2 IPython Basics
Running the IPython Shell
Running the Jupyter Notebook
Tab Completion
Introspection
2.3 Python Language Basics
Language Semantics
Scalar Types
Control Flow
2.4 Conclusion
Chapter 3. Built-In Data Structures, Functions, and Files
3.1 Data Structures and Sequences
Tuple
List
Dictionary
Set
Built-In Sequence Functions
List, Set, and Dictionary Comprehensions
3.2 Functions
Namespaces, Scope, and Local Functions
Returning Multiple Values
Functions Are Objects
Anonymous (Lambda) Functions
Generators
Errors and Exception Handling
3.3 Files and the Operating System
Bytes and Unicode with Files
3.4 Conclusion
Chapter 4. NumPy Basics: Arrays and Vectorized Computation
4.1 The NumPy ndarray: A Multidimensional Array Object
Creating ndarrays
Data Types for ndarrays
Arithmetic with NumPy Arrays
Basic Indexing and Slicing
Boolean Indexing
Fancy Indexing
Transposing Arrays and Swapping Axes
4.2 Pseudorandom Number Generation
4.3 Universal Functions: Fast Element-Wise Array Functions
4.4 Array-Oriented Programming with Arrays
Expressing Conditional Logic as Array Operations
Mathematical and Statistical Methods
Methods for Boolean Arrays
Sorting
Unique and Other Set Logic
4.5 File Input and Output with Arrays
4.6 Linear Algebra
4.7 Example: Random Walks
Simulating Many Random Walks at Once
4.8 Conclusion
Chapter 5. Getting Started with pandas
5.1 Introduction to pandas Data Structures
Series
DataFrame
Index Objects
5.2 Essential Functionality
Reindexing
Dropping Entries from an Axis
Indexing, Selection, and Filtering
Arithmetic and Data Alignment
Function Application and Mapping
Sorting and Ranking
Axis Indexes with Duplicate Labels
5.3 Summarizing and Computing Descriptive Statistics
Correlation and Covariance
Unique Values, Value Counts, and Membership
5.4 Conclusion
Chapter 6. Data Loading, Storage, and File Formats
6.1 Reading and Writing Data in Text Format
Reading Text Files in Pieces
Writing Data to Text Format
Working with Other Delimited Formats
JSON Data
XML and HTML: Web Scraping
6.2 Binary Data Formats
Reading Microsoft Excel Files
Using HDF5 Format
6.3 Interacting with Web APIs
6.4 Interacting with Databases
6.5 Conclusion
Chapter 7. Data Cleaning and Preparation
7.1 Handling Missing Data
Filtering Out Missing Data
Filling In Missing Data
7.2 Data Transformation
Removing Duplicates
Transforming Data Using a Function or Mapping
Replacing Values
Renaming Axis Indexes
Discretization and Binning
Detecting and Filtering Outliers
Permutation and Random Sampling
Computing Indicator/Dummy Variables
7.3 Extension Data Types
7.4 String Manipulation
Python Built-In String Object Methods
Regular Expressions
String Functions in pandas
7.5 Categorical Data
Background and Motivation
Categorical Extension Type in pandas
Computations with Categoricals
Categorical Methods
7.6 Conclusion
Chapter 8. Data Wrangling: Join, Combine, and Reshape
8.1 Hierarchical Indexing
Reordering and Sorting Levels
Summary Statistics by Level
Indexing with a DataFrame’s columns
8.2 Combining and Merging Datasets
Database-Style DataFrame Joins
Merging on Index
Concatenating Along an Axis
Combining Data with Overlap
8.3 Reshaping and Pivoting
Reshaping with Hierarchical Indexing
Pivoting “Long” to “Wide” Format
Pivoting “Wide” to “Long” Format
8.4 Conclusion
Chapter 9. Plotting and Visualization
9.1 A Brief matplotlib API Primer
Figures and Subplots
Colors, Markers, and Line Styles
Ticks, Labels, and Legends
Annotations and Drawing on a Subplot
Saving Plots to File
matplotlib Configuration
9.2 Plotting with pandas and seaborn
Line Plots
Bar Plots
Histograms and Density Plots
Scatter or Point Plots
Facet Grids and Categorical Data
9.3 Other Python Visualization Tools
9.4 Conclusion
Chapter 10. Data Aggregation and Group Operations
10.1 How to Think About Group Operations
Iterating over Groups
Selecting a Column or Subset of Columns
Grouping with Dictionaries and Series
Grouping with Functions
Grouping by Index Levels
10.2 Data Aggregation
Column-Wise and Multiple Function Application
Returning Aggregated Data Without Row Indexes
10.3 Apply: General split-apply-combine
Suppressing the Group Keys
Quantile and Bucket Analysis
Example: Filling Missing Values with Group-Specific Values
Example: Random Sampling and Permutation
Example: Group Weighted Average and Correlation
Example: Group-Wise Linear Regression
10.4 Group Transforms and “Unwrapped” GroupBys
10.5 Pivot Tables and Cross-Tabulation
Cross-Tabulations: Crosstab
10.6 Conclusion
Chapter 11. Time Series
11.1 Date and Time Data Types and Tools
Converting Between String and Datetime
11.2 Time Series Basics
Indexing, Selection, Subsetting
Time Series with Duplicate Indices
11.3 Date Ranges, Frequencies, and Shifting
Generating Date Ranges
Frequencies and Date Offsets
Shifting (Leading and Lagging) Data
11.4 Time Zone Handling
Time Zone Localization and Conversion
Operations with Time Zone-Aware Timestamp Objects
Operations Between Different Time Zones
11.5 Periods and Period Arithmetic
Period Frequency Conversion
Quarterly Period Frequencies
Converting Timestamps to Periods (and Back)
Creating a PeriodIndex from Arrays
11.6 Resampling and Frequency Conversion
Downsampling
Upsampling and Interpolation
Resampling with Periods
Grouped Time Resampling
11.7 Moving Window Functions
Exponentially Weighted Functions
Binary Moving Window Functions
User-Defined Moving Window Functions
11.8 Conclusion
Chapter 12. Introduction to Modeling Libraries in Python
12.1 Interfacing Between pandas and Model Code
12.2 Creating Model Descriptions with Patsy
Data Transformations in Patsy Formulas
Categorical Data and Patsy
12.3 Introduction to statsmodels
Estimating Linear Models
Estimating Time Series Processes
12.4 Introduction to scikit-learn
12.5 Conclusion
Chapter 13. Data Analysis Examples
13.1 Bitly Data from 1.USA.gov
Counting Time Zones in Pure Python
Counting Time Zones with pandas
13.2 MovieLens 1M Dataset
Measuring Rating Disagreement
13.3 US Baby Names 1880–2010
Analyzing Naming Trends
13.4 USDA Food Database
13.5 2012 Federal Election Commission Database
Donation Statistics by Occupation and Employer
Bucketing Donation Amounts
Donation Statistics by State
13.6 Conclusion
Appendix A. Advanced NumPy
A.1 ndarray Object Internals
NumPy Data Type Hierarchy
A.2 Advanced Array Manipulation
Reshaping Arrays
C Versus FORTRAN Order
Concatenating and Splitting Arrays
Repeating Elements: tile and repeat
Fancy Indexing Equivalents: take and put
A.3 Broadcasting
Broadcasting over Other Axes
Setting Array Values by Broadcasting
A.4 Advanced ufunc Usage
ufunc Instance Methods
Writing New ufuncs in Python
A.5 Structured and Record Arrays
Nested Data Types and Multidimensional Fields
Why Use Structured Arrays?
A.6 More About Sorting
Indirect Sorts: argsort and lexsort
Alternative Sort Algorithms
Partially Sorting Arrays
numpy.searchsorted: Finding Elements in a Sorted Array
A.7 Writing Fast NumPy Functions with Numba
Creating Custom numpy.ufunc Objects with Numba
A.8 Advanced Array Input and Output
Memory-Mapped Files
HDF5 and Other Array Storage Options
A.9 Performance Tips
The Importance of Contiguous Memory
Appendix B. More on the IPython System
B.1 Terminal Keyboard Shortcuts
B.2 About Magic Commands
The %run Command
Executing Code from the Clipboard
B.3 Using the Command History
Searching and Reusing the Command History
Input and Output Variables
B.4 Interacting with the Operating System
Shell Commands and Aliases
Directory Bookmark System
B.5 Software Development Tools
Interactive Debugger
Timing Code: %time and %timeit
Basic Profiling: %prun and %run -p
Profiling a Function Line by Line
B.6 Tips for Productive Code Development Using IPython
Reloading Module Dependencies
Code Design Tips
B.7 Advanced IPython Features
Profiles and Configuration
B.8 Conclusion
Index
About the Author
Colophon

Recommend Papers

Python for Data Analysis: Data Wrangling with Pandas, NumPy, and IPython 2nd Edition 9789352136414, 9352136411

Get complete instructions for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3

1,695 202 3MB Read more

Hands-On Data Analysis with Pandas Efficiently perform data collection, wrangling, analysis, and visualization using Python 9781789615326

113 73 17MB Read more

Mastering Data Analysis with Python: A Comprehensive Guide to NumPy, Pandas, and Matplotlib

Are you tired of feeling like you're stuck in a dead-end job with no room for growth or advancement? Are you ready

546 61 4MB Read more

Hands-On Data Analysis with Pandas: A Python data science handbook for data collection, wrangling, analysis, and visualization [2 ed.] 1800563450, 9781800563452

Get to grips with pandas - a versatile and high-performance library for manipulating, processing, cleaning, and crunchin

3,521 363 71MB Read more

Hands-On Data Analysis with Pandas: A Python data science handbook for data collection, wrangling, analysis, and visualization, 2nd Edition 9781800563452, 9781789955248, 9781838826048, 1800563450

Get to grips with pandas - a fast, versatile, and high-performance Python library for data discovery, data manipulation,

737 19 97MB Read more

DATA SCIENCE PYTHON FOR BEGINNERS: The Complete Guide on Data Manipulation Using Matplotlib, Numpy, And Pandas

"Data Science with Python for Beginners" is a comprehensive guidebook that introduces readers to the fundament

340 123 5MB Read more

Practical Data Science with Jupyter: Explore Data Cleaning, Pre-processing, Data Wrangling, Feature Engineering and Machine Learning using Python and Jupyter (English Edition) 9389898064, 9789389898064

Solve business problems with data-driven techniques and easy-to-follow Python examples Key FeaturesEssential coverage o

2,108 212 19MB Read more

Minimalist Data Wrangling with Python 9780645571912

Minimalist Data Wrangling with Python is envisaged as a student's first introduction to data science, providing a h

118 33 7MB Read more

Python Data Analytics: With Pandas, NumPy, and Matplotlib, 3rd Edition [3 ed.] 9781484295311, 9781484295328

Explore the latest Python tools and techniques to help you tackle the world of data acquisition and analysis. You'l

110 41 19MB Read more

Data Wrangling Using Pandas, SQL, and Java 9781683929048, 1683929047

This book is intended primarily for those who plan to become data scientists as wellas anyone who needs to perform data

99 68 5MB Read more

Python for Data Analysis: Data Wrangling with pandas, NumPy, and Jupyter [3 ed.]
109810403X, 9781098104030

Author / Uploaded
Wes McKinney

Commentary
Publisher's PDF

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

ird n Th itio Ed

Python

for Data Analysis

Data Wrangling with pandas, NumPy & Jupyter powered by

Wes McKinney

Python for Data Analysis Get the definitive handbook for manipulating, processing, cleaning, and crunching datasets in Python. Updated for Python 3.10 and pandas 1.4, the third edition of this handson guide is packed with practical case studies that show you how to solve a broad set of data analysis problems effectively. You’ll learn the latest versions of pandas, NumPy, and Jupyter in the process. Written by Wes McKinney, the creator of the Python pandas project, this book is a practical, modern introduction to data science tools in Python. It’s ideal for analysts new to Python and for Python programmers new to data science and scientific computing. Data files and related material are available on GitHub.

• Use the Jupyter notebook and the IPython shell for exploratory computing

• Learn basic and advanced features in NumPy • Get started with data analysis tools in the pandas library • Use flexible tools to load, clean, transform, merge, and reshape data

• Create informative visualizations with matplotlib • Apply the pandas groupBy facility to slice, dice, and summarize datasets

• Analyze and manipulate regular and irregular time series

“With this new edition, Wes has updated his book to ensure it remains the go-to resource for all things related to data analysis with Python and pandas. I cannot recommend this book highly enough.” —Paul Barry

Lecturer and author of O’Reilly’s Head First Python

Wes McKinney, cofounder and chief technology officer of Voltron Data, is an active member of the Python data community and an advocate for Python use in data analysis, finance, and statistical computing applications. A graduate of MIT, he’s also a member of the project management committees for the Apache Software Foundation’s Apache Arrow and Apache Parquet projects.

data

• Learn how to solve real-world data analysis problems with thorough, detailed examples

DATA

US $69.99 CAN $87.99 ISBN: 978-1-098-10403-0

56999

9

781098 104030

Twitter: @oreillymedia linkedin.com/company/oreilly-media youtube.com/oreillymedia

THIRD EDITION

Python for Data Analysis

Data Wrangling with pandas, NumPy, and Jupyter

Wes McKinney

Beijing

Boston Farnham Sebastopol

Tokyo

Python for Data Analysis by Wes McKinney Copyright © 2022 Wesley McKinney. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://oreilly.com). For more information, contact our corporate/institutional sales department: 800-998-9938 or [email protected].

Acquisitions Editor: Jessica Haberman Development Editor: Angela Rufino Production Editor: Christopher Faucher Copyeditor: Sonia Saruba Proofreader: Piper Editorial Consulting, LLC October 2012: October 2017: August 2022:

Indexer: Sue Klefstad Interior Designer: David Futato Cover Designer: Karen Montgomery Illustrator: Kate Dullea

First Edition Second Edition Third Edition

Revision History for the Third Edition 2022-08-12: First Release See https://www.oreilly.com/catalog/errata.csp?isbn=0636920519829 for release details. The O’Reilly logo is a registered trademark of O’Reilly Media, Inc. Python for Data Analysis, the cover image, and related trade dress are trademarks of O’Reilly Media, Inc. While the publisher and the author have used good faith efforts to ensure that the information and instructions contained in this work are accurate, the publisher and the author disclaim all responsibility for errors or omissions, including without limitation responsibility for damages resulting from the use of or reliance on this work. Use of the information and instructions contained in this work is at your own risk. If any code samples or other technology this work contains or describes is subject to open source licenses or the intellectual property rights of others, it is your responsibility to ensure that your use thereof complies with such licenses and/or rights.

978-1-098-10403-0 [LSI]

Table of Contents

Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Preliminaries. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 What Is This Book About? What Kinds of Data? 1.2 Why Python for Data Analysis? Python as Glue Solving the “Two-Language” Problem Why Not Python? 1.3 Essential Python Libraries NumPy pandas matplotlib IPython and Jupyter SciPy scikit-learn statsmodels Other Packages 1.4 Installation and Setup Miniconda on Windows GNU/Linux Miniconda on macOS Installing Necessary Packages Integrated Development Environments and Text Editors 1.5 Community and Conferences 1.6 Navigating This Book Code Examples

1 1 2 3 3 3 4 4 5 6 6 7 8 8 9 9 9 10 11 11 12 13 14 15

iii

Data for Examples Import Conventions

15 16

2. Python Language Basics, IPython, and Jupyter Notebooks. . . . . . . . . . . . . . . . . . . . . . . . 17 2.1 The Python Interpreter 2.2 IPython Basics Running the IPython Shell Running the Jupyter Notebook Tab Completion Introspection 2.3 Python Language Basics Language Semantics Scalar Types Control Flow 2.4 Conclusion

18 19 19 20 23 25 26 26 34 42 45

3. Built-In Data Structures, Functions, and Files. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 3.1 Data Structures and Sequences Tuple List Dictionary Set Built-In Sequence Functions List, Set, and Dictionary Comprehensions 3.2 Functions Namespaces, Scope, and Local Functions Returning Multiple Values Functions Are Objects Anonymous (Lambda) Functions Generators Errors and Exception Handling 3.3 Files and the Operating System Bytes and Unicode with Files 3.4 Conclusion

47 47 51 55 59 62 63 65 67 68 69 70 71 74 76 80 82

4. NumPy Basics: Arrays and Vectorized Computation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.1 The NumPy ndarray: A Multidimensional Array Object Creating ndarrays Data Types for ndarrays Arithmetic with NumPy Arrays Basic Indexing and Slicing

iv

|

Table of Contents

85 86 88 91 92

Boolean Indexing Fancy Indexing Transposing Arrays and Swapping Axes 4.2 Pseudorandom Number Generation 4.3 Universal Functions: Fast Element-Wise Array Functions 4.4 Array-Oriented Programming with Arrays Expressing Conditional Logic as Array Operations Mathematical and Statistical Methods Methods for Boolean Arrays Sorting Unique and Other Set Logic 4.5 File Input and Output with Arrays 4.6 Linear Algebra 4.7 Example: Random Walks Simulating Many Random Walks at Once 4.8 Conclusion

97 100 102 103 105 108 110 111 113 114 115 116 116 118 120 121

5. Getting Started with pandas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 5.1 Introduction to pandas Data Structures Series DataFrame Index Objects 5.2 Essential Functionality Reindexing Dropping Entries from an Axis Indexing, Selection, and Filtering Arithmetic and Data Alignment Function Application and Mapping Sorting and Ranking Axis Indexes with Duplicate Labels 5.3 Summarizing and Computing Descriptive Statistics Correlation and Covariance Unique Values, Value Counts, and Membership 5.4 Conclusion

124 124 129 136 138 138 141 142 152 158 160 164 165 168 170 173

6. Data Loading, Storage, and File Formats. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 6.1 Reading and Writing Data in Text Format Reading Text Files in Pieces Writing Data to Text Format Working with Other Delimited Formats JSON Data

175 182 184 185 187

Table of Contents

|

v

XML and HTML: Web Scraping 6.2 Binary Data Formats Reading Microsoft Excel Files Using HDF5 Format 6.3 Interacting with Web APIs 6.4 Interacting with Databases 6.5 Conclusion

189 193 194 195 197 199 201

7. Data Cleaning and Preparation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 7.1 Handling Missing Data Filtering Out Missing Data Filling In Missing Data 7.2 Data Transformation Removing Duplicates Transforming Data Using a Function or Mapping Replacing Values Renaming Axis Indexes Discretization and Binning Detecting and Filtering Outliers Permutation and Random Sampling Computing Indicator/Dummy Variables 7.3 Extension Data Types 7.4 String Manipulation Python Built-In String Object Methods Regular Expressions String Functions in pandas 7.5 Categorical Data Background and Motivation Categorical Extension Type in pandas Computations with Categoricals Categorical Methods 7.6 Conclusion

203 205 207 209 209 211 212 214 215 217 219 221 224 227 227 229 232 235 236 237 240 242 245

8. Data Wrangling: Join, Combine, and Reshape. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 8.1 Hierarchical Indexing Reordering and Sorting Levels Summary Statistics by Level Indexing with a DataFrame’s columns 8.2 Combining and Merging Datasets Database-Style DataFrame Joins Merging on Index

vi

|

Table of Contents

247 250 251 252 253 254 259

Concatenating Along an Axis Combining Data with Overlap 8.3 Reshaping and Pivoting Reshaping with Hierarchical Indexing Pivoting “Long” to “Wide” Format Pivoting “Wide” to “Long” Format 8.4 Conclusion

263 268 270 270 273 277 279

9. Plotting and Visualization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 281 9.1 A Brief matplotlib API Primer Figures and Subplots Colors, Markers, and Line Styles Ticks, Labels, and Legends Annotations and Drawing on a Subplot Saving Plots to File matplotlib Configuration 9.2 Plotting with pandas and seaborn Line Plots Bar Plots Histograms and Density Plots Scatter or Point Plots Facet Grids and Categorical Data 9.3 Other Python Visualization Tools 9.4 Conclusion

282 283 288 290 294 296 297 298 298 301 309 311 314 317 317

10. Data Aggregation and Group Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 319 10.1 How to Think About Group Operations Iterating over Groups Selecting a Column or Subset of Columns Grouping with Dictionaries and Series Grouping with Functions Grouping by Index Levels 10.2 Data Aggregation Column-Wise and Multiple Function Application Returning Aggregated Data Without Row Indexes 10.3 Apply: General split-apply-combine Suppressing the Group Keys Quantile and Bucket Analysis Example: Filling Missing Values with Group-Specific Values Example: Random Sampling and Permutation Example: Group Weighted Average and Correlation

320 324 326 327 328 328 329 331 335 335 338 338 340 343 344

Table of Contents

|

vii

Example: Group-Wise Linear Regression 10.4 Group Transforms and “Unwrapped” GroupBys 10.5 Pivot Tables and Cross-Tabulation Cross-Tabulations: Crosstab 10.6 Conclusion

347 347 351 354 355

11. Time Series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357 11.1 Date and Time Data Types and Tools Converting Between String and Datetime 11.2 Time Series Basics Indexing, Selection, Subsetting Time Series with Duplicate Indices 11.3 Date Ranges, Frequencies, and Shifting Generating Date Ranges Frequencies and Date Offsets Shifting (Leading and Lagging) Data 11.4 Time Zone Handling Time Zone Localization and Conversion Operations with Time Zone-Aware Timestamp Objects Operations Between Different Time Zones 11.5 Periods and Period Arithmetic Period Frequency Conversion Quarterly Period Frequencies Converting Timestamps to Periods (and Back) Creating a PeriodIndex from Arrays 11.6 Resampling and Frequency Conversion Downsampling Upsampling and Interpolation Resampling with Periods Grouped Time Resampling 11.7 Moving Window Functions Exponentially Weighted Functions Binary Moving Window Functions User-Defined Moving Window Functions 11.8 Conclusion

358 359 361 363 365 366 367 370 371 374 375 377 378 379 380 382 384 385 387 388 391 392 394 396 399 401 402 403

12. Introduction to Modeling Libraries in Python. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 12.1 Interfacing Between pandas and Model Code 12.2 Creating Model Descriptions with Patsy Data Transformations in Patsy Formulas Categorical Data and Patsy

viii

| Table of Contents

405 408 410 412

12.3 Introduction to statsmodels Estimating Linear Models Estimating Time Series Processes 12.4 Introduction to scikit-learn 12.5 Conclusion

415 415 419 420 423

13. Data Analysis Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 13.1 Bitly Data from 1.USA.gov Counting Time Zones in Pure Python Counting Time Zones with pandas 13.2 MovieLens 1M Dataset Measuring Rating Disagreement 13.3 US Baby Names 1880–2010 Analyzing Naming Trends 13.4 USDA Food Database 13.5 2012 Federal Election Commission Database Donation Statistics by Occupation and Employer Bucketing Donation Amounts Donation Statistics by State 13.6 Conclusion

425 426 428 435 439 443 448 457 463 466 469 471 472

A. Advanced NumPy. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 A.1 ndarray Object Internals NumPy Data Type Hierarchy A.2 Advanced Array Manipulation Reshaping Arrays C Versus FORTRAN Order Concatenating and Splitting Arrays Repeating Elements: tile and repeat Fancy Indexing Equivalents: take and put A.3 Broadcasting Broadcasting over Other Axes Setting Array Values by Broadcasting A.4 Advanced ufunc Usage ufunc Instance Methods Writing New ufuncs in Python A.5 Structured and Record Arrays Nested Data Types and Multidimensional Fields Why Use Structured Arrays? A.6 More About Sorting Indirect Sorts: argsort and lexsort

473 474 476 476 478 479 481 483 484 487 489 490 490 493 493 494 495 495 497

Table of Contents

|

ix

Alternative Sort Algorithms Partially Sorting Arrays numpy.searchsorted: Finding Elements in a Sorted Array A.7 Writing Fast NumPy Functions with Numba Creating Custom numpy.ufunc Objects with Numba A.8 Advanced Array Input and Output Memory-Mapped Files HDF5 and Other Array Storage Options A.9 Performance Tips The Importance of Contiguous Memory

498 499 500 501 502 503 503 504 505 505

B. More on the IPython System. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 B.1 Terminal Keyboard Shortcuts B.2 About Magic Commands The %run Command Executing Code from the Clipboard B.3 Using the Command History Searching and Reusing the Command History Input and Output Variables B.4 Interacting with the Operating System Shell Commands and Aliases Directory Bookmark System B.5 Software Development Tools Interactive Debugger Timing Code: %time and %timeit Basic Profiling: %prun and %run -p Profiling a Function Line by Line B.6 Tips for Productive Code Development Using IPython Reloading Module Dependencies Code Design Tips B.7 Advanced IPython Features Profiles and Configuration B.8 Conclusion

509 510 512 513 514 514 515 516 517 518 519 519 523 525 527 529 529 530 532 532 533

Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535

x

| Table of Contents

Preface

The first edition of this book was published in 2012, during a time when open source data analysis libraries for Python, especially pandas, were very new and developing rapidly. When the time came to write the second edition in 2016 and 2017, I needed to update the book not only for Python 3.6 (the first edition used Python 2.7) but also for the many changes in pandas that had occurred over the previous five years. Now in 2022, there are fewer Python language changes (we are now at Python 3.10, with 3.11 coming out at the end of 2022), but pandas has continued to evolve. In this third edition, my goal is to bring the content up to date with current versions of Python, NumPy, pandas, and other projects, while also remaining relatively con‐ servative about discussing newer Python projects that have appeared in the last few years. Since this book has become an important resource for many university courses and working professionals, I will try to avoid topics that are at risk of falling out of date within a year or two. That way paper copies won’t be too difficult to follow in 2023 or 2024 or beyond. A new feature of the third edition is the open access online version hosted on my website at https://wesmckinney.com/book, to serve as a resource and convenience for owners of the print and digital editions. I intend to keep the content reasonably up to date there, so if you own the paper book and run into something that doesn’t work properly, you should check there for the latest content changes.

Conventions Used in This Book The following typographical conventions are used in this book: Italic Indicates new terms, URLs, email addresses, filenames, and file extensions.

xi

Constant width

Used for program listings, as well as within paragraphs to refer to program elements such as variable or function names, databases, data types, environment variables, statements, and keywords. Constant width bold

Shows commands or other text that should be typed literally by the user. Constant width italic

Shows text that should be replaced with user-supplied values or by values deter‐ mined by context. This element signifies a tip or suggestion.

This element signifies a general note.

This element indicates a warning or caution.

Using Code Examples You can find data files and related material for each chapter in this book’s GitHub repository at https://github.com/wesm/pydata-book, which is mirrored to Gitee (for those who cannot access GitHub) at https://gitee.com/wesmckinn/pydata-book. This book is here to help you get your job done. In general, if example code is offered with this book, you may use it in your programs and documentation. You do not need to contact us for permission unless you’re reproducing a significant portion of the code. For example, writing a program that uses several chunks of code from this book does not require permission. Selling or distributing examples from O’Reilly books does require permission. Answering a question by citing this book and quoting example code does not require permission. Incorporating a significant amount of example code from this book into your product’s documentation does require permission.

xii

|

Preface

We appreciate, but do not require, attribution. An attribution usually includes the title, author, publisher, and ISBN. For example: “Python for Data Analysis by Wes McKinney (O’Reilly). Copyright 2022 Wes McKinney, 978-1-098-10403-0.” If you feel your use of code examples falls outside fair use or the permission given above, feel free to contact us at [email protected].

O’Reilly Online Learning For more than 40 years, O’Reilly Media has provided technol‐ ogy and business training, knowledge, and insight to help companies succeed. Our unique network of experts and innovators share their knowledge and expertise through books, articles, and our online learning platform. O’Reilly’s online learning platform gives you on-demand access to live training courses, in-depth learning paths, interactive coding environments, and a vast collection of text and video from O’Reilly and 200+ other publishers. For more information, visit http://oreilly.com.

How to Contact Us Please address comments and questions concerning this book to the publisher: O’Reilly Media, Inc. 1005 Gravenstein Highway North Sebastopol, CA 95472 800-998-9938 (in the United States or Canada) 707-829-0515 (international or local) 707-829-0104 (fax) We have a web page for this book, where we list errata, examples, and any additional information. You can access this page at https://oreil.ly/python-data-analysis-3e. Email [email protected] to comment or ask technical questions about this book. For news and information about our books and courses, visit http://oreilly.com. Find us on LinkedIn: https://linkedin.com/company/oreilly-media. Follow us on Twitter: http://twitter.com/oreillymedia. Watch us on YouTube: http://youtube.com/oreillymedia.

Preface

|

xiii

Acknowledgments This work is the product of many years of fruitful discussions and collaborations with, and assistance from many people around the world. I’d like to thank a few of them.

In Memoriam: John D. Hunter (1968–2012) Our dear friend and colleague John D. Hunter passed away after a battle with colon cancer on August 28, 2012. This was only a short time after I’d completed the final manuscript for this book’s first edition. John’s impact and legacy in the Python scientific and data communities would be hard to overstate. In addition to developing matplotlib in the early 2000s (a time when Python was not nearly so popular), he helped shape the culture of a critical generation of open source developers who’ve become pillars of the Python ecosystem that we now often take for granted. I was lucky enough to connect with John early in my open source career in January 2010, just after releasing pandas 0.1. His inspiration and mentorship helped me push forward, even in the darkest of times, with my vision for pandas and Python as a first-class data analysis language. John was very close with Fernando Pérez and Brian Granger, pioneers of IPython, Jupyter, and many other initiatives in the Python community. We had hoped to work on a book together, the four of us, but I ended up being the one with the most free time. I am sure he would be proud of what we’ve accomplished, as individuals and as a community, over the last nine years.

Acknowledgments for the Third Edition (2022) It has more than a decade since I started writing the first edition of this book and more than 15 years since I originally started my journey as a Python prorammer. A lot has changed since then! Python has evolved from a relatively niche language for data analysis to the most popular and most widely used language powering the plurality (if not the majority!) of data science, machine learning, and artificial intelligence work. I have not been an active contributor to the pandas open source project since 2013, but its worldwide developer community has continued to thrive, serving as a model of community-centric open source software development. Many “next-generation” Python projects that deal with tabular data are modeling their user interfaces directly after pandas, so the project has proved to have an enduring influence on the future trajectory of the Python data science ecosystem.

xiv

| Preface

I hope that this book continues to serve as a valuable resource for students and individuals who want to learn about working with data in Python. I’m especially thankful to O’Reilly for allowing me to publish an “open access” version of this book on my website at https://wesmckinney.com/book, where I hope it will reach even more people and help expand opportunity in the world of data analysis. J.J. Allaire was a lifesaver in making this possible by helping me “port” the book from Docbook XML to Quarto, a wonderful new scientific and technical publishing system for print and web. Special thanks to my technical reviewers Paul Barry, Jean-Christophe Leyder, Abdul‐ lah Karasan, and William Jamir, whose thorough feedback has greatly improved the readability, clarity, and understandability of the content.

Acknowledgments for the Second Edition (2017) It has been five years almost to the day since I completed the manuscript for this book’s first edition in July 2012. A lot has changed. The Python community has grown immensely, and the ecosystem of open source software around it has flourished. This new edition of the book would not exist if not for the tireless efforts of the pandas core developers, who have grown the project and its user community into one of the cornerstones of the Python data science ecosystem. These include, but are not limited to, Tom Augspurger, Joris van den Bossche, Chris Bartak, Phillip Cloud, gfyoung, Andy Hayden, Masaaki Horikoshi, Stephan Hoyer, Adam Klein, Wouter Overmeire, Jeff Reback, Chang She, Skipper Seabold, Jeff Tratner, and y-p. On the actual writing of this second edition, I would like to thank the O’Reilly staff who helped me patiently with the writing process. This includes Marie Beaugureau, Ben Lorica, and Colleen Toporek. I again had outstanding technical reviewers with Tom Augspurger, Paul Barry, Hugh Brown, Jonathan Coe, and Andreas Müller con‐ tributing. Thank you. This book’s first edition has been translated into many foreign languages, including Chinese, French, German, Japanese, Korean, and Russian. Translating all this content and making it available to a broader audience is a huge and often thankless effort. Thank you for helping more people in the world learn how to program and use data analysis tools. I am also lucky to have had support for my continued open source development efforts from Cloudera and Two Sigma Investments over the last few years. With open source software projects more thinly resourced than ever relative to the size of user bases, it is becoming increasingly important for businesses to provide support for development of key open source projects. It’s the right thing to do.

Preface

|

xv

Acknowledgments for the First Edition (2012) It would have been difficult for me to write this book without the support of a large number of people. On the O’Reilly staff, I’m very grateful for my editors, Meghan Blanchette and Julie Steele, who guided me through the process. Mike Loukides also worked with me in the proposal stages and helped make the book a reality. I received a wealth of technical review from a large cast of characters. In particu‐ lar, Martin Blais and Hugh Brown were incredibly helpful in improving the book’s examples, clarity, and organization from cover to cover. James Long, Drew Conway, Fernando Pérez, Brian Granger, Thomas Kluyver, Adam Klein, Josh Klein, Chang She, and Stéfan van der Walt each reviewed one or more chapters, providing pointed feedback from many different perspectives. I got many great ideas for examples and datasets from friends and colleagues in the data community, among them: Mike Dewar, Jeff Hammerbacher, James Johndrow, Kristian Lum, Adam Klein, Hilary Mason, Chang She, and Ashley Williams. I am of course indebted to the many leaders in the open source scientific Python community who’ve built the foundation for my development work and gave encour‐ agement while I was writing this book: the IPython core team (Fernando Pérez, Brian Granger, Min Ragan-Kelly, Thomas Kluyver, and others), John Hunter, Skipper Seabold, Travis Oliphant, Peter Wang, Eric Jones, Robert Kern, Josef Perktold, Fran‐ cesc Alted, Chris Fonnesbeck, and too many others to mention. Several other people provided a great deal of support, ideas, and encouragement along the way: Drew Conway, Sean Taylor, Giuseppe Paleologo, Jared Lander, David Epstein, John Krowas, Joshua Bloom, Den Pilsworth, John Myles-White, and many others I’ve forgotten. I’d also like to thank a number of people from my formative years. First, my former AQR colleagues who’ve cheered me on in my pandas work over the years: Alex Reyf‐ man, Michael Wong, Tim Sargen, Oktay Kurbanov, Matthew Tschantz, Roni Israelov, Michael Katz, Ari Levine, Chris Uga, Prasad Ramanan, Ted Square, and Hoon Kim. Lastly, my academic advisors Haynes Miller (MIT) and Mike West (Duke). I received significant help from Phillip Cloud and Joris van den Bossche in 2014 to update the book’s code examples and fix some other inaccuracies due to changes in pandas. On the personal side, Casey provided invaluable day-to-day support during the writing process, tolerating my highs and lows as I hacked together the final draft on top of an already overcommitted schedule. Lastly, my parents, Bill and Kim, taught me to always follow my dreams and to never settle for less.

xvi

|

Preface

CHAPTER 1

Preliminaries

1.1 What Is This Book About? This book is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. My goal is to offer a guide to the parts of the Python programming language and its data-oriented library ecosystem and tools that will equip you to become an effective data analyst. While “data analysis” is in the title of the book, the focus is specifically on Python programming, libraries, and tools as opposed to data analysis methodology. This is the Python programming you need for data analysis. Sometime after I originally published this book in 2012, people started using the term data science as an umbrella description for everything from simple descriptive statistics to more advanced statistical analysis and machine learning. The Python open source ecosystem for doing data analysis (or data science) has also expanded significantly since then. There are now many other books which focus specifically on these more advanced methodologies. My hope is that this book serves as adequate preparation to enable you to move on to a more domain-specific resource. Some might characterize much of the content of the book as “data manipulation” as opposed to “data analysis.” We also use the terms wrangling or munging to refer to data manipulation.

What Kinds of Data? When I say “data,” what am I referring to exactly? The primary focus is on structured data, a deliberately vague term that encompasses many different common forms of data, such as: 1

• Tabular or spreadsheet-like data in which each column may be a different type (string, numeric, date, or otherwise). This includes most kinds of data commonly stored in relational databases or tab- or comma-delimited text files. • Multidimensional arrays (matrices). • Multiple tables of data interrelated by key columns (what would be primary or foreign keys for a SQL user). • Evenly or unevenly spaced time series. This is by no means a complete list. Even though it may not always be obvious, a large percentage of datasets can be transformed into a structured form that is more suitable for analysis and modeling. If not, it may be possible to extract features from a dataset into a structured form. As an example, a collection of news articles could be processed into a word frequency table, which could then be used to perform sentiment analysis. Most users of spreadsheet programs like Microsoft Excel, perhaps the most widely used data analysis tool in the world, will not be strangers to these kinds of data.

1.2 Why Python for Data Analysis? For many people, the Python programming language has strong appeal. Since its first appearance in 1991, Python has become one of the most popular interpreted programming languages, along with Perl, Ruby, and others. Python and Ruby have become especially popular since 2005 or so for building websites using their numer‐ ous web frameworks, like Rails (Ruby) and Django (Python). Such languages are often called scripting languages, as they can be used to quickly write small programs, or scripts to automate other tasks. I don’t like the term “scripting languages,” as it carries a connotation that they cannot be used for building serious software. Among interpreted languages, for various historical and cultural reasons, Python has devel‐ oped a large and active scientific computing and data analysis community. In the last 20 years, Python has gone from a bleeding-edge or “at your own risk” scientific com‐ puting language to one of the most important languages for data science, machine learning, and general software development in academia and industry. For data analysis and interactive computing and data visualization, Python will inevi‐ tably draw comparisons with other open source and commercial programming lan‐ guages and tools in wide use, such as R, MATLAB, SAS, Stata, and others. In recent years, Python’s improved open source libraries (such as pandas and scikit-learn) have made it a popular choice for data analysis tasks. Combined with Python’s overall strength for general-purpose software engineering, it is an excellent option as a primary language for building data applications.

2

|

Chapter 1: Preliminaries

Python as Glue Part of Python’s success in scientific computing is the ease of integrating C, C++, and FORTRAN code. Most modern computing environments share a similar set of legacy FORTRAN and C libraries for doing linear algebra, optimization, integration, fast Fourier transforms, and other such algorithms. The same story has held true for many companies and national labs that have used Python to glue together decades’ worth of legacy software. Many programs consist of small portions of code where most of the time is spent, with large amounts of “glue code” that doesn’t run often. In many cases, the execution time of the glue code is insignificant; effort is most fruitfully invested in optimizing the computational bottlenecks, sometimes by moving the code to a lower-level lan‐ guage like C.

Solving the “Two-Language” Problem In many organizations, it is common to research, prototype, and test new ideas using a more specialized computing language like SAS or R and then later port those ideas to be part of a larger production system written in, say, Java, C#, or C++. What people are increasingly finding is that Python is a suitable language not only for doing research and prototyping but also for building the production systems. Why maintain two development environments when one will suffice? I believe that more and more companies will go down this path, as there are often significant organizational benefits to having both researchers and software engineers using the same set of programming tools. Over the last decade some new approaches to solving the “two-language” problem have appeared, such as the Julia programming language. Getting the most out of Python in many cases will require programming in a low-level language like C or C++ and creating Python bindings to that code. That said, “just-in-time” (JIT) com‐ piler technology provided by libraries like Numba have provided a way to achieve excellent performance in many computational algorithms without having to leave the Python programming environment.

Why Not Python? While Python is an excellent environment for building many kinds of analytical applications and general-purpose systems, there are a number of uses for which Python may be less suitable. As Python is an interpreted programming language, in general most Python code will run substantially slower than code written in a compiled language like Java or C++. As programmer time is often more valuable than CPU time, many are happy to make this trade-off. However, in an application with very low latency or demanding 1.2 Why Python for Data Analysis?

|

3

resource utilization requirements (e.g., a high-frequency trading system), the time spent programming in a lower-level (but also lower-productivity) language like C++ to achieve the maximum possible performance might be time well spent. Python can be a challenging language for building highly concurrent, multithreaded applications, particularly applications with many CPU-bound threads. The reason for this is that it has what is known as the global interpreter lock (GIL), a mechanism that prevents the interpreter from executing more than one Python instruction at a time. The technical reasons for why the GIL exists are beyond the scope of this book. While it is true that in many big data processing applications, a cluster of computers may be required to process a dataset in a reasonable amount of time, there are still situations where a single-process, multithreaded system is desirable. This is not to say that Python cannot execute truly multithreaded, parallel code. Python C extensions that use native multithreading (in C or C++) can run code in parallel without being impacted by the GIL, as long as they do not need to regularly interact with Python objects.

1.3 Essential Python Libraries For those who are less familiar with the Python data ecosystem and the libraries used throughout the book, I will give a brief overview of some of them.

NumPy NumPy, short for Numerical Python, has long been a cornerstone of numerical computing in Python. It provides the data structures, algorithms, and library glue needed for most scientific applications involving numerical data in Python. NumPy contains, among other things: • A fast and efficient multidimensional array object ndarray • Functions for performing element-wise computations with arrays or mathemati‐ cal operations between arrays • Tools for reading and writing array-based datasets to disk • Linear algebra operations, Fourier transform, and random number generation • A mature C API to enable Python extensions and native C or C++ code to access NumPy’s data structures and computational facilities Beyond the fast array-processing capabilities that NumPy adds to Python, one of its primary uses in data analysis is as a container for data to be passed between algorithms and libraries. For numerical data, NumPy arrays are more efficient for storing and manipulating data than the other built-in Python data structures. Also, libraries written in a lower-level language, such as C or FORTRAN, can operate on 4

| Chapter 1: Preliminaries

the data stored in a NumPy array without copying data into some other memory representation. Thus, many numerical computing tools for Python either assume NumPy arrays as a primary data structure or else target interoperability with NumPy.

pandas pandas provides high-level data structures and functions designed to make working with structured or tabular data intuitive and flexible. Since its emergence in 2010, it has helped enable Python to be a powerful and productive data analysis environment. The primary objects in pandas that will be used in this book are the DataFrame, a tabular, column-oriented data structure with both row and column labels, and the Series, a one-dimensional labeled array object. pandas blends the array-computing ideas of NumPy with the kinds of data manipu‐ lation capabilities found in spreadsheets and relational databases (such as SQL). It provides convenient indexing functionality to enable you to reshape, slice and dice, perform aggregations, and select subsets of data. Since data manipulation, prepara‐ tion, and cleaning are such important skills in data analysis, pandas is one of the primary focuses of this book. As a bit of background, I started building pandas in early 2008 during my tenure at AQR Capital Management, a quantitative investment management firm. At the time, I had a distinct set of requirements that were not well addressed by any single tool at my disposal: • Data structures with labeled axes supporting automatic or explicit data align‐ ment—this prevents common errors resulting from misaligned data and working with differently indexed data coming from different sources • Integrated time series functionality • The same data structures handle both time series data and non-time series data • Arithmetic operations and reductions that preserve metadata • Flexible handling of missing data • Merge and other relational operations found in popular databases (SQL-based, for example) I wanted to be able to do all of these things in one place, preferably in a language well suited to general-purpose software development. Python was a good candidate language for this, but at that time an integrated set of data structures and tools providing this functionality did not exist. As a result of having been built initially to solve finance and business analytics problems, pandas features especially deep time series functionality and tools well suited for working with time-indexed data generated by business processes.

1.3 Essential Python Libraries

|

5

I spent a large part of 2011 and 2012 expanding pandas’s capabilities with some of my former AQR colleagues, Adam Klein and Chang She. In 2013, I stopped being as involved in day-to-day project development, and pandas has since become a fully community-owned and community-maintained project with well over two thousand unique contributors around the world. For users of the R language for statistical computing, the DataFrame name will be familiar, as the object was named after the similar R data.frame object. Unlike Python, data frames are built into the R programming language and its standard library. As a result, many features found in pandas are typically either part of the R core implementation or provided by add-on packages. The pandas name itself is derived from panel data, an econometrics term for multidi‐ mensional structured datasets, and a play on the phrase Python data analysis.

matplotlib matplotlib is the most popular Python library for producing plots and other twodimensional data visualizations. It was originally created by John D. Hunter and is now maintained by a large team of developers. It is designed for creating plots suitable for publication. While there are other visualization libraries available to Python programmers, matplotlib is still widely used and integrates reasonably well with the rest of the ecosystem. I think it is a safe choice as a default visualization tool.

IPython and Jupyter The IPython project began in 2001 as Fernando Pérez’s side project to make a better interactive Python interpreter. Over the subsequent 20 years it has become one of the most important tools in the modern Python data stack. While it does not provide any computational or data analytical tools by itself, IPython is designed for both interactive computing and software development work. It encourages an execute-explore workflow instead of the typical edit-compile-run workflow of many other programming languages. It also provides integrated access to your operating system’s shell and filesystem; this reduces the need to switch between a terminal window and a Python session in many cases. Since much of data analysis coding involves exploration, trial and error, and iteration, IPython can help you get the job done faster. In 2014, Fernando and the IPython team announced the Jupyter project, a broader initiative to design language-agnostic interactive computing tools. The IPython web notebook became the Jupyter notebook, with support now for over 40 programming languages. The IPython system can now be used as a kernel (a programming language mode) for using Python with Jupyter.

6

| Chapter 1: Preliminaries

IPython itself has become a component of the much broader Jupyter open source project, which provides a productive environment for interactive and exploratory computing. Its oldest and simplest “mode” is as an enhanced Python shell designed to accelerate the writing, testing, and debugging of Python code. You can also use the IPython system through the Jupyter notebook. The Jupyter notebook system also allows you to author content in Markdown and HTML, providing you a means to create rich documents with code and text. I personally use IPython and Jupyter regularly in my Python work, whether running, debugging, or testing code. In the accompanying book materials on GitHub, you will find Jupyter notebooks containing all the code examples from each chapter. If you cannot access GitHub where you are, you can try the mirror on Gitee.

SciPy SciPy is a collection of packages addressing a number of foundational problems in scientific computing. Here are some of the tools it contains in its various modules: scipy.integrate

Numerical integration routines and differential equation solvers scipy.linalg

Linear algebra routines and matrix decompositions extending beyond those pro‐ vided in numpy.linalg scipy.optimize

Function optimizers (minimizers) and root finding algorithms scipy.signal

Signal processing tools scipy.sparse

Sparse matrices and sparse linear system solvers scipy.special

Wrapper around SPECFUN, a FORTRAN library implementing many common mathematical functions, such as the gamma function scipy.stats

Standard continuous and discrete probability distributions (density functions, samplers, continuous distribution functions), various statistical tests, and more descriptive statistics

1.3 Essential Python Libraries

|

7

Together, NumPy and SciPy form a reasonably complete and mature computational foundation for many traditional scientific computing applications.

scikit-learn Since the project’s inception in 2007, scikit-learn has become the premier generalpurpose machine learning toolkit for Python programmers. As of this writing, more than two thousand different individuals have contributed code to the project. It includes submodules for such models as: • Classification: SVM, nearest neighbors, random forest, logistic regression, etc. • Regression: Lasso, ridge regression, etc. • Clustering: k-means, spectral clustering, etc. • Dimensionality reduction: PCA, feature selection, matrix factorization, etc. • Model selection: Grid search, cross-validation, metrics • Preprocessing: Feature extraction, normalization Along with pandas, statsmodels, and IPython, scikit-learn has been critical for ena‐ bling Python to be a productive data science programming language. While I won’t be able to include a comprehensive guide to scikit-learn in this book, I will give a brief introduction to some of its models and how to use them with the other tools presented in the book.

statsmodels statsmodels is a statistical analysis package that was seeded by work from Stanford University statistics professor Jonathan Taylor, who implemented a number of regres‐ sion analysis models popular in the R programming language. Skipper Seabold and Josef Perktold formally created the new statsmodels project in 2010 and since then have grown the project to a critical mass of engaged users and contributors. Nathaniel Smith developed the Patsy project, which provides a formula or model specification framework for statsmodels inspired by R’s formula system. Compared with scikit-learn, statsmodels contains algorithms for classical (primarily frequentist) statistics and econometrics. This includes such submodules as: • Regression models: linear regression, generalized linear models, robust linear models, linear mixed effects models, etc. • Analysis of variance (ANOVA) • Time series analysis: AR, ARMA, ARIMA, VAR, and other models • Nonparametric methods: Kernel density estimation, kernel regression

8

|

Chapter 1: Preliminaries

• Visualization of statistical model results statsmodels is more focused on statistical inference, providing uncertainty estimates and p-values for parameters. scikit-learn, by contrast, is more prediction focused. As with scikit-learn, I will give a brief introduction to statsmodels and how to use it with NumPy and pandas.

Other Packages In 2022, there are many other Python libraries which might be discussed in a book about data science. This includes some newer projects like TensorFlow or PyTorch, which have become popular for machine learning or artificial intelligence work. Now that there are other books out there that focus more specifically on those projects, I would recommend using this book to build a foundation in general-purpose Python data wrangling. Then, you should be well prepared to move on to a more advanced resource that may assume a certain level of expertise.

1.4 Installation and Setup Since everyone uses Python for different applications, there is no single solution for setting up Python and obtaining the necessary add-on packages. Many readers will not have a complete Python development environment suitable for following along with this book, so here I will give detailed instructions to get set up on each operating system. I will be using Miniconda, a minimal installation of the conda package manager, along with conda-forge, a community-maintained software distribution based on conda. This book uses Python 3.10 throughout, but if you’re reading in the future, you are welcome to install a newer version of Python. If for some reason these instructions become out-of-date by the time you are reading this, you can check out my website for the book which I will endeavor to keep up to date with the latest installation instructions.

Miniconda on Windows To get started on Windows, download the Miniconda installer for the latest Python version available (currently 3.9) from https://conda.io. I recommend following the installation instructions for Windows available on the conda website, which may have changed between the time this book was published and when you are reading this. Most people will want the 64-bit version, but if that doesn’t run on your Windows machine, you can install the 32-bit version instead. When prompted whether to install for just yourself or for all users on your system, choose the option that’s most appropriate for you. Installing just for yourself will be sufficient to follow along with the book. It will also ask you whether you want to 1.4 Installation and Setup

|

9

add Miniconda to the system PATH environment variable. If you select this (I usually do), then this Miniconda installation may override other versions of Python you have installed. If you do not, then you will need to use the Window Start menu shortcut that’s installed to be able to use this Miniconda. This Start menu entry may be called “Anaconda3 (64-bit).” I’ll assume that you haven’t added Miniconda to your system PATH. To verify that things are configured correctly, open the “Anaconda Prompt (Miniconda3)” entry under “Anaconda3 (64-bit)” in the Start menu. Then try launching the Python inter‐ preter by typing python. You should see a message like this: (base) C:\Users\Wes>python Python 3.9 [MSC v.1916 64 bit (AMD64)] :: Anaconda, Inc. on win32 Type "help", "copyright", "credits" or "license" for more information. >>>

To exit the Python shell, type exit() and press Enter.

GNU/Linux Linux details will vary a bit depending on your Linux distribution type, but here I give details for such distributions as Debian, Ubuntu, CentOS, and Fedora. Setup is similar to macOS with the exception of how Miniconda is installed. Most readers will want to download the default 64-bit installer file, which is for x86 architecture (but it’s possible in the future more users will have aarch64-based Linux machines). The installer is a shell script that must be executed in the terminal. You will then have a file named something similar to Miniconda3-latest-Linux-x86_64.sh. To install it, execute this script with bash: $ bash Miniconda3-latest-Linux-x86_64.sh

Some Linux distributions have all the required Python packages (although outdated versions, in some cases) in their package man‐ agers and can be installed using a tool like apt. The setup described here uses Miniconda, as it’s both easily reproducible across distri‐ butions and simpler to upgrade packages to their latest versions.

You will have a choice of where to put the Miniconda files. I recommend installing the files in the default location in your home directory; for example, /home/$USER/ miniconda (with your username, naturally). The installer will ask if you wish to modify your shell scripts to automatically activate Miniconda. I recommend doing this (select “yes”) as a matter of convenience. After completing the installation, start a new terminal process and verify that you are picking up the new Miniconda installation:

10

| Chapter 1: Preliminaries

(base) $ python Python 3.9 | (main) [GCC 10.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>>

To exit the Python shell, type exit() and press Enter or press Ctrl-D.

Miniconda on macOS Download the macOS Miniconda installer, which should be named something like Miniconda3-latest-MacOSX-arm64.sh for Apple Silicon-based macOS computers released from 2020 onward, or Miniconda3-latest-MacOSX-x86_64.sh for Intel-based Macs released before 2020. Open the Terminal application in macOS, and install by executing the installer (most likely in your Downloads directory) with bash: $ bash $HOME/Downloads/Miniconda3-latest-MacOSX-arm64.sh

When the installer runs, by default it automatically configures Miniconda in your default shell environment in your default shell profile. This is probably located at /Users/$USER/.zshrc. I recommend letting it do this; if you do not want to allow the installer to modify your default shell environment, you will need to consult the Miniconda documentation to be able to proceed. To verify everything is working, try launching Python in the system shell (open the Terminal application to get a command prompt): $ python Python 3.9 (main) [Clang 12.0.1 ] on darwin Type "help", "copyright", "credits" or "license" for more information. >>>

To exit the shell, press Ctrl-D or type exit() and press Enter.

Installing Necessary Packages Now that we have set up Miniconda on your system, it’s time to install the main packages we will be using in this book. The first step is to configure conda-forge as your default package channel by running the following commands in a shell: (base) $ conda config --add channels conda-forge (base) $ conda config --set channel_priority strict

Now, we will create a new conda “environment” with the conda create command using Python 3.10: (base) $ conda create -y -n pydata-book python=3.10

After the installation completes, activate the environment with conda activate: (base) $ conda activate pydata-book (pydata-book) $

1.4 Installation and Setup

|

11

It is necessary to use conda activate to activate your environment each time you open a new terminal. You can see information about the active conda environment at any time from the terminal by running conda info.

Now, we will install the essential packages used throughout the book (along with their dependencies) with conda install: (pydata-book) $ conda install -y pandas jupyter matplotlib

We will be using some other packages, too, but these can be installed later once they are needed. There are two ways to install packages: with conda install and with pip install. conda install should always be preferred when using Miniconda, but some packages are not available through conda, so if conda install $package_name fails, try pip install $package_name. If you want to install all of the packages used in the rest of the book, you can do that now by running: conda install lxml beautifulsoup4 html5lib openpyxl \ requests sqlalchemy seaborn scipy statsmodels \ patsy scikit-learn pyarrow pytables numba

On Windows, substitute a carat ^ for the line continuation \ used on Linux and macOS.

You can update packages by using the conda update command: conda update package_name

pip also supports upgrades using the --upgrade flag: pip install --upgrade package_name

You will have several opportunities to try out these commands throughout the book. While you can use both conda and pip to install packages, you should avoid updating packages originally installed with conda using pip (and vice versa), as doing so can lead to environment problems. I recommend sticking to conda if you can and falling back on pip only for packages that are unavailable with conda install.

Integrated Development Environments and Text Editors When asked about my standard development environment, I almost always say “IPy‐ thon plus a text editor.” I typically write a program and iteratively test and debug each piece of it in IPython or Jupyter notebooks. It is also useful to be able to play around 12

|

Chapter 1: Preliminaries

with data interactively and visually verify that a particular set of data manipulations is doing the right thing. Libraries like pandas and NumPy are designed to be productive to use in the shell. When building software, however, some users may prefer to use a more richly featured integrated development environment (IDE) and rather than an editor like Emacs or Vim which provide a more minimal environment out of the box. Here are some that you can explore: • PyDev (free), an IDE built on the Eclipse platform • PyCharm from JetBrains (subscription-based for commercial users, free for open source developers) • Python Tools for Visual Studio (for Windows users) • Spyder (free), an IDE currently shipped with Anaconda • Komodo IDE (commercial) Due to the popularity of Python, most text editors, like VS Code and Sublime Text 2, have excellent Python support.

1.5 Community and Conferences Outside of an internet search, the various scientific and data-related Python mailing lists are generally helpful and responsive to questions. Some to take a look at include: • pydata: A Google Group list for questions related to Python for data analysis and pandas • pystatsmodels: For statsmodels or pandas-related questions • Mailing list for scikit-learn ([email protected]) and machine learning in Python, generally • numpy-discussion: For NumPy-related questions • scipy-user: For general SciPy or scientific Python questions I deliberately did not post URLs for these in case they change. They can be easily located via an internet search. Each year many conferences are held all over the world for Python programmers. If you would like to connect with other Python programmers who share your inter‐ ests, I encourage you to explore attending one, if possible. Many conferences have financial support available for those who cannot afford admission or travel to the conference. Here are some to consider:

1.5 Community and Conferences

|

13

• PyCon and EuroPython: The two main general Python conferences in North America and Europe, respectively • SciPy and EuroSciPy: Scientific-computing-oriented conferences in North Amer‐ ica and Europe, respectively • PyData: A worldwide series of regional conferences targeted at data science and data analysis use cases • International and regional PyCon conferences (see https://pycon.org for a com‐ plete listing)

1.6 Navigating This Book If you have never programmed in Python before, you will want to spend some time in Chapters 2 and 3, where I have placed a condensed tutorial on Python language features and the IPython shell and Jupyter notebooks. These things are prerequisite knowledge for the remainder of the book. If you have Python experience already, you may instead choose to skim or skip these chapters. Next, I give a short introduction to the key features of NumPy, leaving more advanced NumPy use for Appendix A. Then, I introduce pandas and devote the rest of the book to data analysis topics applying pandas, NumPy, and matplotlib (for visualization). I have structured the material in an incremental fashion, though there is occasionally some minor crossover between chapters, with a few cases where concepts are used that haven’t been introduced yet. While readers may have many different end goals for their work, the tasks required generally fall into a number of different broad groups: Interacting with the outside world Reading and writing with a variety of file formats and data stores Preparation Cleaning, munging, combining, normalizing, reshaping, slicing and dicing, and transforming data for analysis Transformation Applying mathematical and statistical operations to groups of datasets to derive new datasets (e.g., aggregating a large table by group variables) Modeling and computation Connecting your data to statistical models, machine learning algorithms, or other computational tools Presentation Creating interactive or static graphical visualizations or textual summaries

14

|

Chapter 1: Preliminaries

Code Examples Most of the code examples in the book are shown with input and output as it would appear executed in the IPython shell or in Jupyter notebooks: In [5]: CODE EXAMPLE Out[5]: OUTPUT

When you see a code example like this, the intent is for you to type the example code in the In block in your coding environment and execute it by pressing the Enter key (or Shift-Enter in Jupyter). You should see output similar to what is shown in the Out block. I changed the default console output settings in NumPy and pandas to improve readability and brevity throughout the book. For example, you may see more digits of precision printed in numeric data. To exactly match the output shown in the book, you can execute the following Python code before running the code examples: import numpy as np import pandas as pd pd.options.display.max_columns = 20 pd.options.display.max_rows = 20 pd.options.display.max_colwidth = 80 np.set_printoptions(precision=4, suppress=True)

Data for Examples Datasets for the examples in each chapter are hosted in a GitHub repository (or in a mirror on Gitee if you cannot access GitHub). You can download this data either by using the Git version control system on the command line or by downloading a zip file of the repository from the website. If you run into problems, navigate to the book website for up-to-date instructions about obtaining the book materials. If you download a zip file containing the example datasets, you must then fully extract the contents of the zip file to a directory and navigate to that directory from the terminal before proceeding with running the book’s code examples: $ pwd /home/wesm/book-materials $ ls appa.ipynb ch05.ipynb ch09.ipynb ch13.ipynb ch02.ipynb ch06.ipynb ch10.ipynb COPYING ch03.ipynb ch07.ipynb ch11.ipynb datasets ch04.ipynb ch08.ipynb ch12.ipynb examples

README.md requirements.txt

1.6 Navigating This Book

|

15

I have made every effort to ensure that the GitHub repository contains everything necessary to reproduce the examples, but I may have made some mistakes or omis‐ sions. If so, please send me an email: [email protected]. The best way to report errors in the book is on the errata page on the O’Reilly website.

Import Conventions The Python community has adopted a number of naming conventions for commonly used modules: import import import import import

numpy as np matplotlib.pyplot as plt pandas as pd seaborn as sns statsmodels as sm

This means that when you see np.arange, this is a reference to the arange function in NumPy. This is done because it’s considered bad practice in Python software development to import everything (from numpy import *) from a large package like NumPy.

16

|

Chapter 1: Preliminaries

CHAPTER 2

Python Language Basics, IPython, and Jupyter Notebooks

When I wrote the first edition of this book in 2011 and 2012, there were fewer resources available for learning about doing data analysis in Python. This was partially a chicken-and-egg problem; many libraries that we now take for granted, like pandas, scikit-learn, and statsmodels, were comparatively immature back then. Now in 2022, there is now a growing literature on data science, data analysis, and machine learning, supplementing the prior works on general-purpose scientific com‐ puting geared toward computational scientists, physicists, and professionals in other research fields. There are also excellent books about learning the Python program‐ ming language itself and becoming an effective software engineer. As this book is intended as an introductory text in working with data in Python, I feel it is valuable to have a self-contained overview of some of the most important features of Python’s built-in data structures and libraries from the perspective of data manipulation. So, I will only present roughly enough information in this chapter and Chapter 3 to enable you to follow along with the rest of the book. Much of this book focuses on table-based analytics and data preparation tools for working with datasets that are small enough to fit on your personal computer. To use these tools you must sometimes do some wrangling to arrange messy data into a more nicely tabular (or structured) form. Fortunately, Python is an ideal language for doing this. The greater your facility with the Python language and its built-in data types, the easier it will be for you to prepare new datasets for analysis. Some of the tools in this book are best explored from a live IPython or Jupyter session. Once you learn how to start up IPython and Jupyter, I recommend that you follow along with the examples so you can experiment and try different things. As

17

with any keyboard-driven console-like environment, developing familiarity with the common commands is also part of the learning curve. There are introductory Python concepts that this chapter does not cover, like classes and object-oriented programming, which you may find useful in your foray into data analysis in Python. To deepen your Python language knowledge, I recommend that you supplement this chapter with the official Python tutorial and potentially one of the many excellent books on general-purpose Python programming. Some recommendations to get you started include: • Python Cookbook, Third Edition, by David Beazley and Brian K. Jones (O’Reilly) • Fluent Python by Luciano Ramalho (O’Reilly) • Effective Python, Second Edition, by Brett Slatkin (AddisonWesley)

2.1 The Python Interpreter Python is an interpreted language. The Python interpreter runs a program by execut‐ ing one statement at a time. The standard interactive Python interpreter can be invoked on the command line with the python command: $ python Python 3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:38:57) [GCC 10.3.0] on linux Type "help", "copyright", "credits" or "license" for more information. >>> a = 5 >>> print(a) 5

The >>> you see is the prompt after which you’ll type code expressions. To exit the Python interpreter, you can either type exit() or press Ctrl-D (works on Linux and macOS only). Running Python programs is as simple as calling python with a .py file as its first argument. Suppose we had created hello_world.py with these contents: print("Hello world")

You can run it by executing the following command (the hello_world.py file must be in your current working terminal directory): $ python hello_world.py Hello world

18

|

Chapter 2: Python Language Basics, IPython, and Jupyter Notebooks

While some Python programmers execute all of their Python code in this way, those doing data analysis or scientific computing make use of IPython, an enhanced Python interpreter, or Jupyter notebooks, web-based code notebooks originally cre‐ ated within the IPython project. I give an introduction to using IPython and Jupyter in this chapter and have included a deeper look at IPython functionality in Appen‐ dix A. When you use the %run command, IPython executes the code in the specified file in the same process, enabling you to explore the results interactively when it’s done: $ ipython Python 3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:38:57) Type 'copyright', 'credits' or 'license' for more information IPython 7.31.1 -- An enhanced Interactive Python. Type '?' for help. In [1]: %run hello_world.py Hello world In [2]:

The default IPython prompt adopts the numbered In [2]: style, compared with the standard >>> prompt.

2.2 IPython Basics In this section, I’ll get you up and running with the IPython shell and Jupyter notebook, and introduce you to some of the essential concepts.

Running the IPython Shell You can launch the IPython shell on the command line just like launching the regular Python interpreter except with the ipython command: $ ipython Python 3.10.4 | packaged by conda-forge | (main, Mar 24 2022, 17:38:57) Type 'copyright', 'credits' or 'license' for more information IPython 7.31.1 -- An enhanced Interactive Python. Type '?' for help. In [1]: a = 5 In [2]: a Out[2]: 5

You can execute arbitrary Python statements by typing them and pressing Return (or Enter). When you type just a variable into IPython, it renders a string representation of the object: In [5]: import numpy as np In [6]: data = [np.random.standard_normal() for i in range(7)]

2.2 IPython Basics

|

19

In [7]: data Out[7]: [-0.20470765948471295, 0.47894333805754824, -0.5194387150567381, -0.55573030434749, 1.9657805725027142, 1.3934058329729904, 0.09290787674371767]

The first two lines are Python code statements; the second statement creates a vari‐ able named data that refers to a newly created Python dictionary. The last line prints the value of data in the console. Many kinds of Python objects are formatted to be more readable, or pretty-printed, which is distinct from normal printing with print. If you printed the above data variable in the standard Python interpreter, it would be much less readable: >>> import numpy as np >>> data = [np.random.standard_normal() for i in range(7)] >>> print(data) >>> data [-0.5767699931966723, -0.1010317773535111, -1.7841005313329152, -1.524392126408841, 0.22191374220117385, -1.9835710588082562, -1.6081963964963528]

IPython also provides facilities to execute arbitrary blocks of code (via a somewhat glorified copy-and-paste approach) and whole Python scripts. You can also use the Jupyter notebook to work with larger blocks of code, as we will soon see.

Running the Jupyter Notebook One of the major components of the Jupyter project is the notebook, a type of interactive document for code, text (including Markdown), data visualizations, and other output. The Jupyter notebook interacts with kernels, which are implementations of the Jupyter interactive computing protocol specific to different programming languages. The Python Jupyter kernel uses the IPython system for its underlying behavior. To start up Jupyter, run the command jupyter notebook in a terminal: $ jupyter notebook [I 15:20:52.739 NotebookApp] Serving notebooks from local directory: /home/wesm/code/pydata-book [I 15:20:52.739 NotebookApp] 0 active kernels [I 15:20:52.739 NotebookApp] The Jupyter Notebook is running at: http://localhost:8888/?token=0a77b52fefe52ab83e3c35dff8de121e4bb443a63f2d... [I 15:20:52.740 NotebookApp] Use Control-C to stop this server and shut down all kernels (twice to skip confirmation). Created new window in existing browser session.

20

|

Chapter 2: Python Language Basics, IPython, and Jupyter Notebooks

To access the notebook, open this file in a browser: file:///home/wesm/.local/share/jupyter/runtime/nbserver-185259-open.html Or copy and paste one of these URLs: http://localhost:8888/?token=0a77b52fefe52ab83e3c35dff8de121e4... or http://127.0.0.1:8888/?token=0a77b52fefe52ab83e3c35dff8de121e4...

On many platforms, Jupyter will automatically open in your default web browser (unless you start it with --no-browser). Otherwise, you can navigate to the HTTP address printed when you started the notebook, here http://localhost:8888/? token=0a77b52fefe52ab83e3c35dff8de121e4bb443a63f2d3055. See Figure 2-1 for what this looks like in Google Chrome. Many people use Jupyter as a local computing environment, but it can also be deployed on servers and accessed remotely. I won’t cover those details here, but I encourage you to explore this topic on the internet if it’s relevant to your needs.

Figure 2-1. Jupyter notebook landing page 2.2 IPython Basics

|

21

To create a new notebook, click the New button and select the “Python 3” option. You should see something like Figure 2-2. If this is your first time, try clicking on the empty code “cell” and entering a line of Python code. Then press Shift-Enter to execute it.

Figure 2-2. Jupyter new notebook view When you save the notebook (see “Save and Checkpoint” under the notebook File menu), it creates a file with the extension .ipynb. This is a self-contained file format that contains all of the content (including any evaluated code output) currently in the notebook. These can be loaded and edited by other Jupyter users. To rename an open notebook, click on the notebook title at the top of the page and type the new title, pressing Enter when you are finished. To load an existing notebook, put the file in the same directory where you started the notebook process (or in a subfolder within it), then click the name from the landing page. You can try it out with the notebooks from my wesm/pydata-book repository on GitHub. See Figure 2-3. When you want to close a notebook, click the File menu and select “Close and Halt.” If you simply close the browser tab, the Python process associated with the notebook will keep running in the background. While the Jupyter notebook may feel like a distinct experience from the IPython shell, nearly all of the commands and tools in this chapter can be used in either environment.

22

|

Chapter 2: Python Language Basics, IPython, and Jupyter Notebooks

Figure 2-3. Jupyter example view for an existing notebook

Tab Completion On the surface, the IPython shell looks like a cosmetically different version of the standard terminal Python interpreter (invoked with python). One of the major improvements over the standard Python shell is tab completion, found in many IDEs or other interactive computing analysis environments. While entering expressions in the shell, pressing the Tab key will search the namespace for any variables (objects, functions, etc.) matching the characters you have typed so far and show the results in a convenient drop-down menu: In [1]: an_apple = 27 In [2]: an_example = 42 In [3]: an an_apple an_example

any

In this example, note that IPython displayed both of the two variables I defined, as well as the built-in function any. Also, you can also complete methods and attributes on any object after typing a period: 2.2 IPython Basics

|

23

In [3]: b = [1, 2, 3] In [4]: b. append() count() clear() extend() copy() index()

insert() pop() remove()

reverse() sort()

The same is true for modules: In [1]: import datetime In [2]: datetime. date MAXYEAR datetime MINYEAR datetime_CAPI time

timedelta timezone tzinfo

Note that IPython by default hides methods and attributes starting with underscores, such as magic methods and internal “private” methods and attributes, in order to avoid cluttering the display (and confusing novice users!). These, too, can be tab-completed, but you must first type an underscore to see them. If you prefer to always see such methods in tab completion, you can change this setting in the IPython configuration. See the IPython documenta‐ tion to find out how to do this.

Tab completion works in many contexts outside of searching the interactive name‐ space and completing object or module attributes. When typing anything that looks like a file path (even in a Python string), pressing the Tab key will complete anything on your computer’s filesystem matching what you’ve typed. Combined with the %run command (see “The %run Command” on page 512), this functionality can save you many keystrokes. Another area where tab completion saves time is in the completion of function keyword arguments (including the = sign!). See Figure 2-4.

Figure 2-4. Autocomplete function keywords in a Jupyter notebook We’ll have a closer look at functions in a little bit. 24

|

Chapter 2: Python Language Basics, IPython, and Jupyter Notebooks

Introspection Using a question mark (?) before or after a variable will display some general infor‐ mation about the object: In [1]: b = [1, 2, 3] In [2]: b? Type: list String form: [1, 2, 3] Length: 3 Docstring: Built-in mutable sequence. If no argument is given, the constructor creates a new empty list. The argument must be an iterable if specified. In [3]: print? Docstring: print(value, ..., sep=' ', end='\n', file=sys.stdout, flush=False) Prints the values to a stream, or to sys.stdout by default. Optional keyword arguments: file: a file-like object (stream); defaults to the current sys.stdout. sep: string inserted between values, default a space. end: string appended after the last value, default a newline. flush: whether to forcibly flush the stream. Type: builtin_function_or_method

This is referred to as object introspection. If the object is a function or instance method, the docstring, if defined, will also be shown. Suppose we’d written the following function (which you can reproduce in IPython or Jupyter): def add_numbers(a, b): """ Add two numbers together Returns ------the_sum : type of arguments """ return a + b

Then using ? shows us the docstring: In [6]: add_numbers? Signature: add_numbers(a, b) Docstring: Add two numbers together Returns ------the_sum : type of arguments

2.2 IPython Basics

|

25

File: Type:

function

? has a final usage, which is for searching the IPython namespace in a manner similar to the standard Unix or Windows command line. A number of characters combined with the wildcard (*) will show all names matching the wildcard expression. For example, we could get a list of all functions in the top-level NumPy namespace containing load: In [9]: import numpy as np In [10]: np.*load*? np.__loader__ np.load np.loads np.loadtxt

2.3 Python Language Basics In this section, I will give you an overview of essential Python programming concepts and language mechanics. In the next chapter, I will go into more detail about Python data structures, functions, and other built-in tools.

Language Semantics The Python language design is distinguished by its emphasis on readability, simplic‐ ity, and explicitness. Some people go so far as to liken it to “executable pseudocode.”

Indentation, not braces Python uses whitespace (tabs or spaces) to structure code instead of using braces as in many other languages like R, C++, Java, and Perl. Consider a for loop from a sorting algorithm: for x in array: if x < pivot: less.append(x) else: greater.append(x)

A colon denotes the start of an indented code block after which all of the code must be indented by the same amount until the end of the block. Love it or hate it, significant whitespace is a fact of life for Python programmers. While it may seem foreign at first, you will hopefully grow accustomed to it in time.

26

| Chapter 2: Python Language Basics, IPython, and Jupyter Notebooks

I strongly recommend using four spaces as your default indentation and replacing tabs with four spaces. Many text editors have a setting that will replace tab stops with spaces automatically (do this!). IPython and Jupyter notebooks will automatically insert four spaces on new lines following a colon and replace tabs by four spaces.

As you can see by now, Python statements also do not need to be terminated by semicolons. Semicolons can be used, however, to separate multiple statements on a single line: a = 5; b = 6; c = 7

Putting multiple statements on one line is generally discouraged in Python as it can make code less readable.

Everything is an object An important characteristic of the Python language is the consistency of its object model. Every number, string, data structure, function, class, module, and so on exists in the Python interpreter in its own “box,” which is referred to as a Python object. Each object has an associated type (e.g., integer, string, or function) and internal data. In practice this makes the language very flexible, as even functions can be treated like any other object.

Comments Any text preceded by the hash mark (pound sign) # is ignored by the Python interpreter. This is often used to add comments to code. At times you may also want to exclude certain blocks of code without deleting them. One solution is to comment out the code: results = [] for line in file_handle: # keep the empty lines for now # if len(line) == 0: # continue results.append(line.replace("foo", "bar"))

Comments can also occur after a line of executed code. While some programmers prefer comments to be placed in the line preceding a particular line of code, this can be useful at times: print("Reached this line")

# Simple status report

2.3 Python Language Basics

|

27

Function and object method calls You call functions using parentheses and passing zero or more arguments, optionally assigning the returned value to a variable: result = f(x, y, z) g()

Almost every object in Python has attached functions, known as methods, that have access to the object’s internal contents. You can call them using the following syntax: obj.some_method(x, y, z)

Functions can take both positional and keyword arguments: result = f(a, b, c, d=5, e="foo")

We will look at this in more detail later.

Variables and argument passing When assigning a variable (or name) in Python, you are creating a reference to the object shown on the righthand side of the equals sign. In practical terms, consider a list of integers: In [8]: a = [1, 2, 3]

Suppose we assign a to a new variable b: In [9]: b = a In [10]: b Out[10]: [1, 2, 3]

In some languages, the assignment if b will cause the data [1, 2, 3] to be copied. In Python, a and b actually now refer to the same object, the original list [1, 2, 3] (see Figure 2-5 for a mock-up). You can prove this to yourself by appending an element to a and then examining b: In [11]: a.append(4) In [12]: b Out[12]: [1, 2, 3, 4]

Figure 2-5. Two references for the same object Understanding the semantics of references in Python, and when, how, and why data is copied, is especially critical when you are working with larger datasets in Python. 28

| Chapter 2: Python Language Basics, IPython, and Jupyter Notebooks

Assignment is also referred to as binding, as we are binding a name to an object. Variable names that have been assigned may occasionally be referred to as bound variables.

When you pass objects as arguments to a function, new local variables are created referencing the original objects without any copying. If you bind a new object to a variable inside a function, that will not overwrite a variable of the same name in the “scope” outside of the function (the “parent scope”). It is therefore possible to alter the internals of a mutable argument. Suppose we had the following function: In [13]: def append_element(some_list, element): ....: some_list.append(element)

Then we have: In [14]: data = [1, 2, 3] In [15]: append_element(data, 4) In [16]: data Out[16]: [1, 2, 3, 4]

Dynamic references, strong types Variables in Python have no inherent type associated with them; a variable can refer to a different type of object simply by doing an assignment. There is no problem with the following: In [17]: a = 5 In [18]: type(a) Out[18]: int In [19]: a = "foo" In [20]: type(a) Out[20]: str

Variables are names for objects within a particular namespace; the type information is stored in the object itself. Some observers might hastily conclude that Python is not a “typed language.” This is not true; consider this example: In [21]: "5" + 5 --------------------------------------------------------------------------TypeError Traceback (most recent call last) in ----> 1 "5" + 5 TypeError: can only concatenate str (not "int") to str

2.3 Python Language Basics

|

29

In some languages, the string '5' might get implicitly converted (or cast) to an integer, thus yielding 10. In other languages the integer 5 might be cast to a string, yielding the concatenated string '55'. In Python, such implicit casts are not allowed. In this regard we say that Python is a strongly typed language, which means that every object has a specific type (or class), and implicit conversions will occur only in certain permitted circumstances, such as: In [22]: a = 4.5 In [23]: b = 2 # String formatting, to be visited later In [24]: print(f"a is {type(a)}, b is {type(b)}") a is , b is In [25]: a / b Out[25]: 2.25

Here, even though b is an integer, it is implicitly converted to a float for the division operation. Knowing the type of an object is important, and it’s useful to be able to write functions that can handle many different kinds of input. You can check that an object is an instance of a particular type using the isinstance function: In [26]: a = 5 In [27]: isinstance(a, int) Out[27]: True

isinstance can accept a tuple of types if you want to check that an object’s type is among those present in the tuple: In [28]: a = 5; b = 4.5 In [29]: isinstance(a, (int, float)) Out[29]: True In [30]: isinstance(b, (int, float)) Out[30]: True

Attributes and methods Objects in Python typically have both attributes (other Python objects stored “inside” the object) and methods (functions associated with an object that can have access to the object’s internal data). Both of them are accessed via the syntax obj.attribute_name: In [1]: a = "foo" In [2]: a.

30

|

Chapter 2: Python Language Basics, IPython, and Jupyter Notebooks

capitalize() casefold() center() count() encode() endswith() expandtabs() find() format() format_map()

index() isprintable() isalnum() isalpha() isascii() isdecimal() isdigit() isidentifier() islower() isnumeric()

isspace() istitle() isupper() join() ljust() lower() lstrip() maketrans() partition() removeprefix()

removesuffix() replace() rfind() rindex() rjust() rpartition() rsplit() rstrip() split() splitlines()

startswith() strip() swapcase() title() translate()

Attributes and methods can also be accessed by name via the getattr function: In [32]: getattr(a, "split") Out[32]:

While we will not extensively use the functions getattr and related functions hasattr and setattr in this book, they can be used very effectively to write generic, reusable code.

Duck typing Often you may not care about the type of an object but rather only whether it has certain methods or behavior. This is sometimes called duck typing, after the saying “If it walks like a duck and quacks like a duck, then it’s a duck.” For example, you can verify that an object is iterable if it implements the iterator protocol. For many objects, this means it has an __iter__ “magic method,” though an alternative and better way to check is to try using the iter function: In [33]: def isiterable(obj): ....: try: ....: iter(obj) ....: return True ....: except TypeError: # not iterable ....: return False

This function would return True for strings as well as most Python collection types: In [34]: isiterable("a string") Out[34]: True In [35]: isiterable([1, 2, 3]) Out[35]: True In [36]: isiterable(5) Out[36]: False

2.3 Python Language Basics

|

31

Imports In Python, a module is simply a file with the .py extension containing Python code. Suppose we had the following module: # some_module.py PI = 3.14159 def f(x): return x + 2 def g(a, b): return a + b

If we wanted to access the variables and functions defined in some_module.py, from another file in the same directory we could do: import some_module result = some_module.f(5) pi = some_module.PI

Or alternately: from some_module import g, PI result = g(5, PI)

By using the as keyword, you can give imports different variable names: import some_module as sm from some_module import PI as pi, g as gf r1 = sm.f(pi) r2 = gf(6, pi)

Binary operators and comparisons Most of the binary math operations and comparisons use familiar mathematical syntax used in other programming languages: In [37]: 5 - 7 Out[37]: -2 In [38]: 12 + 21.5 Out[38]: 33.5 In [39]: 5 = b True if a is greater than (greater than or equal to) b is b True if a and b reference the same Python object is not b True if a and b reference different Python objects

To check if two variables refer to the same object, use the is keyword. Use is not to check that two objects are not the same: In [40]: a = [1, 2, 3] In [41]: b = a In [42]: c = list(a) In [43]: a is b Out[43]: True In [44]: a is not c Out[44]: True

Since the list function always creates a new Python list (i.e., a copy), we can be sure that c is distinct from a. Comparing with is is not the same as the == operator, because in this case we have: In [45]: a == c Out[45]: True

2.3 Python Language Basics

|

33

A common use of is and is not is to check if a variable is None, since there is only one instance of None: In [46]: a = None In [47]: a is None Out[47]: True

Mutable and immutable objects Many objects in Python, such as lists, dictionaries, NumPy arrays, and most userdefined types (classes), are mutable. This means that the object or values that they contain can be modified: In [48]: a_list = ["foo", 2, [4, 5]] In [49]: a_list[2] = (3, 4) In [50]: a_list Out[50]: ['foo', 2, (3, 4)]

Others, like strings and tuples, are immutable, which means their internal data cannot be changed: In [51]: a_tuple = (3, 5, (4, 5)) In [52]: a_tuple[1] = "four" --------------------------------------------------------------------------TypeError Traceback (most recent call last) in ----> 1 a_tuple[1] = "four" TypeError: 'tuple' object does not support item assignment

Remember that just because you can mutate an object does not mean that you always should. Such actions are known as side effects. For example, when writing a function, any side effects should be explicitly communicated to the user in the function’s documentation or comments. If possible, I recommend trying to avoid side effects and favor immutability, even though there may be mutable objects involved.

Scalar Types Python has a small set of built-in types for handling numerical data, strings, Boolean (True or False) values, and dates and time. These “single value” types are sometimes called scalar types, and we refer to them in this book as scalars . See Table 2-2 for a list of the main scalar types. Date and time handling will be discussed separately, as these are provided by the datetime module in the standard library.

34

|

Chapter 2: Python Language Basics, IPython, and Jupyter Notebooks

Table 2-2. Standard Python scalar types Type

Description None The Python “null” value (only one instance of the None object exists) String type; holds Unicode strings str bytes Raw binary data float Double-precision floating-point number (note there is no separate double type) bool A Boolean True or False value Arbitrary precision integer int

Numeric types The primary Python types for numbers are int and float. An int can store arbitrar‐ ily large numbers: In [53]: ival = 17239871 In [54]: ival ** 6 Out[54]: 26254519291092456596965462913230729701102721

Floating-point numbers are represented with the Python float type. Under the hood, each one is a double-precision value. They can also be expressed with scientific notation: In [55]: fval = 7.243 In [56]: fval2 = 6.78e-5

Integer division not resulting in a whole number will always yield a floating-point number: In [57]: 3 / 2 Out[57]: 1.5

To get C-style integer division (which drops the fractional part if the result is not a whole number), use the floor division operator //: In [58]: 3 // 2 Out[58]: 1

Strings Many people use Python for its built-in string handling capabilities. You can write string literals using either single quotes ' or double quotes " (double quotes are generally favored): a = 'one way of writing a string' b = "another way"

The Python string type is str.

2.3 Python Language Basics

|

35

For multiline strings with line breaks, you can use triple quotes, either ''' or """: c = """ This is a longer string that spans multiple lines """

It may surprise you that this string c actually contains four lines of text; the line breaks after """ and after lines are included in the string. We can count the new line characters with the count method on c: In [60]: c.count("\n") Out[60]: 3

Python strings are immutable; you cannot modify a string: In [61]: a = "this is a string" In [62]: a[10] = "f" --------------------------------------------------------------------------TypeError Traceback (most recent call last) in ----> 1 a[10] = "f" TypeError: 'str' object does not support item assignment

To interpret this error message, read from the bottom up. We tried to replace the character (the “item”) at position 10 with the letter "f", but this is not allowed for string objects. If we need to modify a string, we have to use a function or method that creates a new string, such as the string replace method: In [63]: b = a.replace("string", "longer string") In [64]: b Out[64]: 'this is a longer string'

Afer this operation, the variable a is unmodified: In [65]: a Out[65]: 'this is a string'

Many Python objects can be converted to a string using the str function: In [66]: a = 5.6 In [67]: s = str(a) In [68]: print(s) 5.6

Strings are a sequence of Unicode characters and therefore can be treated like other sequences, such as lists and tuples: In [69]: s = "python"

36

|

Chapter 2: Python Language Basics, IPython, and Jupyter Notebooks

In [70]: list(s) Out[70]: ['p', 'y', 't', 'h', 'o', 'n'] In [71]: s[:3] Out[71]: 'pyt'

The syntax s[:3] is called slicing and is implemented for many kinds of Python sequences. This will be explained in more detail later on, as it is used extensively in this book. The backslash character \ is an escape character, meaning that it is used to specify special characters like newline \n or Unicode characters. To write a string literal with backslashes, you need to escape them: In [72]: s = "12\\34" In [73]: print(s) 12\34

If you have a string with a lot of backslashes and no special characters, you might find this a bit annoying. Fortunately you can preface the leading quote of the string with r, which means that the characters should be interpreted as is: In [74]: s = r"this\has\no\special\characters" In [75]: s Out[75]: 'this\\has\\no\\special\\characters'

The r stands for raw. Adding two strings together concatenates them and produces a new string: In [76]: a = "this is the first half " In [77]: b = "and this is the second half" In [78]: a + b Out[78]: 'this is the first half and this is the second half'

String templating or formatting is another important topic. The number of ways to do so has expanded with the advent of Python 3, and here I will briefly describe the mechanics of one of the main interfaces. String objects have a format method that can be used to substitute formatted arguments into the string, producing a new string: In [79]: template = "{0:.2f} {1:s} are worth US${2:d}"

In this string: • {0:.2f} means to format the first argument as a floating-point number with two decimal places.

2.3 Python Language Basics

|

37

• {1:s} means to format the second argument as a string. • {2:d} means to format the third argument as an exact integer. To substitute arguments for these format parameters, we pass a sequence of argu‐ ments to the format method: In [80]: template.format(88.46, "Argentine Pesos", 1) Out[80]: '88.46 Argentine Pesos are worth US$1'

Python 3.6 introduced a new feature called f-strings (short for formatted string literals) which can make creating formatted strings even more convenient. To create an fstring, write the character f immediately preceding a string literal. Within the string, enclose Python expressions in curly braces to substitute the value of the expression into the formatted string: In [81]: amount = 10 In [82]: rate = 88.46 In [83]: currency = "Pesos" In [84]: result = f"{amount} {currency} is worth US${amount / rate}"

Format specifiers can be added after each expression using the same syntax as with the string templates above: In [85]: f"{amount} {currency} is worth US${amount / rate:.2f}" Out[85]: '10 Pesos is worth US$0.11'

String formatting is a deep topic; there are multiple methods and numerous options and tweaks available to control how values are formatted in the resulting string. To learn more, consult the official Python documentation.

Bytes and Unicode In modern Python (i.e., Python 3.0 and up), Unicode has become the first-class string type to enable more consistent handling of ASCII and non-ASCII text. In older versions of Python, strings were all bytes without any explicit Unicode encoding. You could convert to Unicode assuming you knew the character encoding. Here is an example Unicode string with non-ASCII characters: In [86]: val = "español" In [87]: val Out[87]: 'español'

We can convert this Unicode string to its UTF-8 bytes representation using the encode method: In [88]: val_utf8 = val.encode("utf-8")

38

|

Chapter 2: Python Language Basics, IPython, and Jupyter Notebooks

In [89]: val_utf8 Out[89]: b'espa\xc3\xb1ol' In [90]: type(val_utf8) Out[90]: bytes

Assuming you know the Unicode encoding of a bytes object, you can go back using the decode method: In [91]: val_utf8.decode("utf-8") Out[91]: 'español'

While it is now preferable to use UTF-8 for any encoding, for historical reasons you may encounter data in any number of different encodings: In [92]: val.encode("latin1") Out[92]: b'espa\xf1ol' In [93]: val.encode("utf-16") Out[93]: b'\xff\xfee\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00' In [94]: val.encode("utf-16le") Out[94]: b'e\x00s\x00p\x00a\x00\xf1\x00o\x00l\x00'

It is most common to encounter bytes objects in the context of working with files, where implicitly decoding all data to Unicode strings may not be desired.

Booleans The two Boolean values in Python are written as True and False. Comparisons and other conditional expressions evaluate to either True or False. Boolean values are combined with the and and or keywords: In [95]: True and True Out[95]: True In [96]: False or True Out[96]: True

When converted to numbers, False becomes 0 and True becomes 1: In [97]: int(False) Out[97]: 0 In [98]: int(True) Out[98]: 1

The keyword not flips a Boolean value from True to False or vice versa: In [99]: a = True In [100]: b = False

2.3 Python Language Basics

|

39

In [101]: not a Out[101]: False In [102]: not b Out[102]: True

Type casting The str, bool, int, and float types are also functions that can be used to cast values to those types: In [103]: s = "3.14159" In [104]: fval = float(s) In [105]: type(fval) Out[105]: float In [106]: int(fval) Out[106]: 3 In [107]: bool(fval) Out[107]: True In [108]: bool(0) Out[108]: False

Note that most nonzero values when cast to bool become True.

None None is the Python null value type: In [109]: a = None In [110]: a is None Out[110]: True In [111]: b = 5 In [112]: b is not None Out[112]: True

None is also a common default value for function arguments: def add_and_maybe_multiply(a, b, c=None): result = a + b if c is not None: result = result * c return result

40

| Chapter 2: Python Language Basics, IPython, and Jupyter Notebooks

Dates and times The built-in Python datetime module provides datetime, date, and time types. The datetime type combines the information stored in date and time and is the most commonly used: In [113]: from datetime import datetime, date, time In [114]: dt = datetime(2011, 10, 29, 20, 30, 21) In [115]: dt.day Out[115]: 29 In [116]: dt.minute Out[116]: 30

Given a datetime instance, you can extract the equivalent date and time objects by calling methods on the datetime of the same name: In [117]: dt.date() Out[117]: datetime.date(2011, 10, 29) In [118]: dt.time() Out[118]: datetime.time(20, 30, 21)

The strftime method formats a datetime as a string: In [119]: dt.strftime("%Y-%m-%d %H:%M") Out[119]: '2011-10-29 20:30'

Strings can be converted (parsed) into datetime objects with the strptime function: In [120]: datetime.strptime("20091031", "%Y%m%d") Out[120]: datetime.datetime(2009, 10, 31, 0, 0)

See Table 11-2 for a full list of format specifications. When you are aggregating or otherwise grouping time series data, it will occasionally be useful to replace time fields of a series of datetimes—for example, replacing the minute and second fields with zero: In [121]: dt_hour = dt.replace(minute=0, second=0) In [122]: dt_hour Out[122]: datetime.datetime(2011, 10, 29, 20, 0)

Since datetime.datetime is an immutable type, methods like these always produce new objects. So in the previous example, dt is not modified by replace: In [123]: dt Out[123]: datetime.datetime(2011, 10, 29, 20, 30, 21)

The difference of two datetime objects produces a datetime.timedelta type:

2.3 Python Language Basics

|

41

In [124]: dt2 = datetime(2011, 11, 15, 22, 30) In [125]: delta = dt2 - dt In [126]: delta Out[126]: datetime.timedelta(days=17, seconds=7179) In [127]: type(delta) Out[127]: datetime.timedelta

The output timedelta(17, 7179) indicates that the timedelta encodes an offset of 17 days and 7,179 seconds. Adding a timedelta to a datetime produces a new shifted datetime: In [128]: dt Out[128]: datetime.datetime(2011, 10, 29, 20, 30, 21) In [129]: dt + delta Out[129]: datetime.datetime(2011, 11, 15, 22, 30)

Control Flow Python has several built-in keywords for conditional logic, loops, and other standard control flow concepts found in other programming languages.

if, elif, and else The if statement is one of the most well-known control flow statement types. It checks a condition that, if True, evaluates the code in the block that follows: x = -5 if x < 0: print("It's negative")

An if statement can be optionally followed by one or more elif blocks and a catchall else block if all of the conditions are False: if x < 0: print("It's negative") elif x == 0: print("Equal to zero") elif 0 < x < 5: print("Positive but smaller than 5") else: print("Positive and larger than or equal to 5")

If any of the conditions are True, no further elif or else blocks will be reached. With a compound condition using and or or, conditions are evaluated left to right and will short-circuit:

42

| Chapter 2: Python Language Basics, IPython, and Jupyter Notebooks

In [130]: a = 5; b = 7 In [131]: c = 8; d = 4 In [132]: if a < b or c > d: .....: print("Made it") Made it

In this example, the comparison c > d never gets evaluated because the first compar‐ ison was True. It is also possible to chain comparisons: In [133]: 4 > 3 > 2 > 1 Out[133]: True

for loops for loops are for iterating over a collection (like a list or tuple) or an iterater. The standard syntax for a for loop is: for value in collection: # do something with value

You can advance a for loop to the next iteration, skipping the remainder of the block, using the continue keyword. Consider this code, which sums up integers in a list and skips None values: sequence = [1, 2, None, 4, None, 5] total = 0 for value in sequence: if value is None: continue total += value

A for loop can be exited altogether with the break keyword. This code sums ele‐ ments of the list until a 5 is reached: sequence = [1, 2, 0, 4, 6, 5, 2, 1] total_until_5 = 0 for value in sequence: if value == 5: break total_until_5 += value

The break keyword only terminates the innermost for loop; any outer for loops will continue to run: In [134]: for i in range(4): .....: for j in range(4): .....: if j > i: .....: break .....: print((i, j))

2.3 Python Language Basics

|

43

.....: 0) 0) 1) 0) 1) 2) 0) 1) 2) 3)

(0, (1, (1, (2, (2, (2, (3, (3, (3, (3,

As we will see in more detail, if the elements in the collection or iterator are sequen‐ ces (tuples or lists, say), they can be conveniently unpacked into variables in the for loop statement: for a, b, c in iterator: # do something

while loops A while loop specifies a condition and a block of code that is to be executed until the condition evaluates to False or the loop is explicitly ended with break: x = 256 total = 0 while x > 0: if total > 500: break total += x x = x // 2

pass pass is the “no-op” (or “do nothing”) statement in Python. It can be used in blocks where no action is to be taken (or as a placeholder for code not yet implemented); it is required only because Python uses whitespace to delimit blocks: if x < 0: print("negative!") elif x == 0: # TODO: put something smart here pass else: print("positive!")

range The range function generates a sequence of evenly spaced integers: In [135]: range(10) Out[135]: range(0, 10)

44

|

Chapter 2: Python Language Basics, IPython, and Jupyter Notebooks

In [136]: list(range(10)) Out[136]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

A start, end, and step (which may be negative) can be given: In [137]: list(range(0, 20, 2)) Out[137]: [0, 2, 4, 6, 8, 10, 12, 14, 16, 18] In [138]: list(range(5, 0, -1)) Out[138]: [5, 4, 3, 2, 1]

As you can see, range produces integers up to but not including the endpoint. A common use of range is for iterating through sequences by index: In [139]: seq = [1, 2, 3, 4] In [140]: for i in range(len(seq)): .....: print(f"element {i}: {seq[i]}") element 0: 1 element 1: 2 element 2: 3 element 3: 4

While you can use functions like list to store all the integers generated by range in some other data structure, often the default iterator form will be what you want. This snippet sums all numbers from 0 to 99,999 that are multiples of 3 or 5: In [141]: total = 0 In [142]: for i in range(100_000): .....: # % is the modulo operator .....: if i % 3 == 0 or i % 5 == 0: .....: total += i In [143]: print(total) 2333316668

While the range generated can be arbitrarily large, the memory use at any given time may be very small.

2.4 Conclusion This chapter provided a brief introduction to some basic Python language concepts and the IPython and Jupyter programming environments. In the next chapter, I will discuss many built-in data types, functions, and input-output utilities that will be used continuously throughout the rest of the book.

2.4 Conclusion

|

45

CHAPTER 3

Built-In Data Structures, Functions, and Files

This chapter discusses capabilities built into the Python language that will be used ubiquitously throughout the book. While add-on libraries like pandas and NumPy add advanced computational functionality for larger datasets, they are designed to be used together with Python’s built-in data manipulation tools. We’ll start with Python’s workhorse data structures: tuples, lists, dictionaries, and sets. Then, we’ll discuss creating your own reusable Python functions. Finally, we’ll look at the mechanics of Python file objects and interacting with your local hard drive.

3.1 Data Structures and Sequences Python’s data structures are simple but powerful. Mastering their use is a critical part of becoming a proficient Python programmer. We start with tuple, list, and dictionary, which are some of the most frequently used sequence types.

Tuple A tuple is a fixed-length, immutable sequence of Python objects which, once assigned, cannot be changed. The easiest way to create one is with a comma-separated sequence of values wrapped in parentheses: In [2]: tup = (4, 5, 6) In [3]: tup Out[3]: (4, 5, 6)

47

In many contexts, the parentheses can be omitted, so here we could also have written: In [4]: tup = 4, 5, 6 In [5]: tup Out[5]: (4, 5, 6)

You can convert any sequence or iterator to a tuple by invoking tuple: In [6]: tuple([4, 0, 2]) Out[6]: (4, 0, 2) In [7]: tup = tuple('string') In [8]: tup Out[8]: ('s', 't', 'r', 'i', 'n', 'g')

Elements can be accessed with square brackets [] as with most other sequence types. As in C, C++, Java, and many other languages, sequences are 0-indexed in Python: In [9]: tup[0] Out[9]: 's'

When you’re defining tuples within more complicated expressions, it’s often neces‐ sary to enclose the values in parentheses, as in this example of creating a tuple of tuples: In [10]: nested_tup = (4, 5, 6), (7, 8) In [11]: nested_tup Out[11]: ((4, 5, 6), (7, 8)) In [12]: nested_tup[0] Out[12]: (4, 5, 6) In [13]: nested_tup[1] Out[13]: (7, 8)

While the objects stored in a tuple may be mutable themselves, once the tuple is created it’s not possible to modify which object is stored in each slot: In [14]: tup = tuple(['foo', [1, 2], True]) In [15]: tup[2] = False --------------------------------------------------------------------------TypeError Traceback (most recent call last) in ----> 1 tup[2] = False TypeError: 'tuple' object does not support item assignment

If an object inside a tuple is mutable, such as a list, you can modify it in place: In [16]: tup[1].append(3)

48

|

Chapter 3: Built-In Data Structures, Functions, and Files

In [17]: tup Out[17]: ('foo', [1, 2, 3], True)

You can concatenate tuples using the + operator to produce longer tuples: In [18]: (4, None, 'foo') + (6, 0) + ('bar',) Out[18]: (4, None, 'foo', 6, 0, 'bar')

Multiplying a tuple by an integer, as with lists, has the effect of concatenating that many copies of the tuple: In [19]: ('foo', 'bar') * 4 Out[19]: ('foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'bar')

Note that the objects themselves are not copied, only the references to them.

Unpacking tuples If you try to assign to a tuple-like expression of variables, Python will attempt to unpack the value on the righthand side of the equals sign: In [20]: tup = (4, 5, 6) In [21]: a, b, c = tup In [22]: b Out[22]: 5

Even sequences with nested tuples can be unpacked: In [23]: tup = 4, 5, (6, 7) In [24]: a, b, (c, d) = tup In [25]: d Out[25]: 7

Using this functionality you can easily swap variable names, a task that in many languages might look like: tmp = a a = b b = tmp

But, in Python, the swap can be done like this: In [26]: a, b = 1, 2 In [27]: a Out[27]: 1 In [28]: b Out[28]: 2 In [29]: b, a = a, b

3.1 Data Structures and Sequences

|

49

In [30]: a Out[30]: 2 In [31]: b Out[31]: 1

A common use of variable unpacking is iterating over sequences of tuples or lists: In [32]: seq = [(1, 2, 3), (4, 5, 6), (7, 8, 9)] In [33]: for a, b, c in seq: ....: print(f'a={a}, b={b}, c={c}') a=1, b=2, c=3 a=4, b=5, c=6 a=7, b=8, c=9

Another common use is returning multiple values from a function. I’ll cover this in more detail later. There are some situations where you may want to “pluck” a few elements from the beginning of a tuple. There is a special syntax that can do this, *rest, which is also used in function signatures to capture an arbitrarily long list of positional arguments: In [34]: values = 1, 2, 3, 4, 5 In [35]: a, b, *rest = values In [36]: a Out[36]: 1 In [37]: b Out[37]: 2 In [38]: rest Out[38]: [3, 4, 5]

This rest bit is sometimes something you want to discard; there is nothing special about the rest name. As a matter of convention, many Python programmers will use the underscore (_) for unwanted variables: In [39]: a, b, *_ = values

Tuple methods Since the size and contents of a tuple cannot be modified, it is very light on instance methods. A particularly useful one (also available on lists) is count, which counts the number of occurrences of a value: In [40]: a = (1, 2, 2, 2, 3, 4, 2) In [41]: a.count(2) Out[41]: 4

50

|

Chapter 3: Built-In Data Structures, Functions, and Files

List In contrast with tuples, lists are variable length and their contents can be modified in place. Lists are mutable. You can define them using square brackets [] or using the list type function: In [42]: a_list = [2, 3, 7, None] In [43]: tup = ("foo", "bar", "baz") In [44]: b_list = list(tup) In [45]: b_list Out[45]: ['foo', 'bar', 'baz'] In [46]: b_list[1] = "peekaboo" In [47]: b_list Out[47]: ['foo', 'peekaboo', 'baz']

Lists and tuples are semantically similar (though tuples cannot be modified) and can be used interchangeably in many functions. The list built-in function is frequently used in data processing as a way to material‐ ize an iterator or generator expression: In [48]: gen = range(10) In [49]: gen Out[49]: range(0, 10) In [50]: list(gen) Out[50]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

Adding and removing elements Elements can be appended to the end of the list with the append method: In [51]: b_list.append("dwarf") In [52]: b_list Out[52]: ['foo', 'peekaboo', 'baz', 'dwarf']

Using insert you can insert an element at a specific location in the list: In [53]: b_list.insert(1, "red") In [54]: b_list Out[54]: ['foo', 'red', 'peekaboo', 'baz', 'dwarf']

The insertion index must be between 0 and the length of the list, inclusive.

3.1 Data Structures and Sequences

|

51

insert is computationally expensive compared with append,

because references to subsequent elements have to be shifted inter‐ nally to make room for the new element. If you need to insert elements at both the beginning and end of a sequence, you may wish to explore collections.deque, a double-ended queue, which is optimized for this purpose and found in the Python Standard Library.

The inverse operation to insert is pop, which removes and returns an element at a particular index: In [55]: b_list.pop(2) Out[55]: 'peekaboo' In [56]: b_list Out[56]: ['foo', 'red', 'baz', 'dwarf']

Elements can be removed by value with remove, which locates the first such value and removes it from the list: In [57]: b_list.append("foo") In [58]: b_list Out[58]: ['foo', 'red', 'baz', 'dwarf', 'foo'] In [59]: b_list.remove("foo") In [60]: b_list Out[60]: ['red', 'baz', 'dwarf', 'foo']

If performance is not a concern, by using append and remove, you can use a Python list as a set-like data structure (although Python has actual set objects, discussed later). Check if a list contains a value using the in keyword: In [61]: "dwarf" in b_list Out[61]: True

The keyword not can be used to negate in: In [62]: "dwarf" not in b_list Out[62]: False

Checking whether a list contains a value is a lot slower than doing so with diction‐ aries and sets (to be introduced shortly), as Python makes a linear scan across the values of the list, whereas it can check the others (based on hash tables) in constant time.

52

|

Chapter 3: Built-In Data Structures, Functions, and Files

Concatenating and combining lists Similar to tuples, adding two lists together with + concatenates them: In [63]: [4, None, "foo"] + [7, 8, (2, 3)] Out[63]: [4, None, 'foo', 7, 8, (2, 3)]

If you have a list already defined, you can append multiple elements to it using the extend method: In [64]: x = [4, None, "foo"] In [65]: x.extend([7, 8, (2, 3)]) In [66]: x Out[66]: [4, None, 'foo', 7, 8, (2, 3)]

Note that list concatenation by addition is a comparatively expensive operation since a new list must be created and the objects copied over. Using extend to append elements to an existing list, especially if you are building up a large list, is usually preferable. Thus: everything = [] for chunk in list_of_lists: everything.extend(chunk)

is faster than the concatenative alternative: everything = [] for chunk in list_of_lists: everything = everything + chunk

Sorting You can sort a list in place (without creating a new object) by calling its sort function: In [67]: a = [7, 2, 5, 1, 3] In [68]: a.sort() In [69]: a Out[69]: [1, 2, 3, 5, 7]

sort has a few options that will occasionally come in handy. One is the ability to pass a secondary sort key—that is, a function that produces a value to use to sort the objects. For example, we could sort a collection of strings by their lengths: In [70]: b = ["saw", "small", "He", "foxes", "six"] In [71]: b.sort(key=len) In [72]: b Out[72]: ['He', 'saw', 'six', 'small', 'foxes']

3.1 Data Structures and Sequences

|

53

Soon, we’ll look at the sorted function, which can produce a sorted copy of a general sequence.

Slicing You can select sections of most sequence types by using slice notation, which in its basic form consists of start:stop passed to the indexing operator []: In [73]: seq = [7, 2, 3, 7, 5, 6, 0, 1] In [74]: seq[1:5] Out[74]: [2, 3, 7, 5]

Slices can also be assigned with a sequence: In [75]: seq[3:5] = [6, 3] In [76]: seq Out[76]: [7, 2, 3, 6, 3, 6, 0, 1]

While the element at the start index is included, the stop index is not included, so that the number of elements in the result is stop - start. Either the start or stop can be omitted, in which case they default to the start of the sequence and the end of the sequence, respectively: In [77]: seq[:5] Out[77]: [7, 2, 3, 6, 3] In [78]: seq[3:] Out[78]: [6, 3, 6, 0, 1]

Negative indices slice the sequence relative to the end: In [79]: seq[-4:] Out[79]: [3, 6, 0, 1] In [80]: seq[-6:-2] Out[80]: [3, 6, 3, 6]

Slicing semantics takes a bit of getting used to, especially if you’re coming from R or MATLAB. See Figure 3-1 for a helpful illustration of slicing with positive and negative integers. In the figure, the indices are shown at the “bin edges” to help show where the slice selections start and stop using positive or negative indices.

54

| Chapter 3: Built-In Data Structures, Functions, and Files

Figure 3-1. Illustration of Python slicing conventions A step can also be used after a second colon to, say, take every other element: In [81]: seq[::2] Out[81]: [7, 3, 3, 0]

A clever use of this is to pass -1, which has the useful effect of reversing a list or tuple: In [82]: seq[::-1] Out[82]: [1, 0, 6, 3, 6, 3, 2, 7]

Dictionary The dictionary or dict may be the most important built-in Python data structure. In other programming languages, dictionaries are sometimes called hash maps or associative arrays. A dictionary stores a collection of key-value pairs, where key and value are Python objects. Each key is associated with a value so that a value can be conveniently retrieved, inserted, modified, or deleted given a particular key. One approach for creating a dictionary is to use curly braces {} and colons to separate keys and values: In [83]: empty_dict = {} In [84]: d1 = {"a": "some value", "b": [1, 2, 3, 4]} In [85]: d1 Out[85]: {'a': 'some value', 'b': [1, 2, 3, 4]}

You can access, insert, or set elements using the same syntax as for accessing elements of a list or tuple: In [86]: d1[7] = "an integer" In [87]: d1 Out[87]: {'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'} In [88]: d1["b"] Out[88]: [1, 2, 3, 4]

3.1 Data Structures and Sequences

|

55

You can check if a dictionary contains a key using the same syntax used for checking whether a list or tuple contains a value: In [89]: "b" in d1 Out[89]: True

You can delete values using either the del keyword or the pop method (which simultaneously returns the value and deletes the key): In [90]: d1[5] = "some value" In [91]: d1 Out[91]: {'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer', 5: 'some value'} In [92]: d1["dummy"] = "another value" In [93]: d1 Out[93]: {'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer', 5: 'some value', 'dummy': 'another value'} In [94]: del d1[5] In [95]: d1 Out[95]: {'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer', 'dummy': 'another value'} In [96]: ret = d1.pop("dummy") In [97]: ret Out[97]: 'another value' In [98]: d1 Out[98]: {'a': 'some value', 'b': [1, 2, 3, 4], 7: 'an integer'}

The keys and values method gives you iterators of the dictionary’s keys and values, respectively. The order of the keys depends on the order of their insertion, and these functions output the keys and values in the same respective order: In [99]: list(d1.keys()) Out[99]: ['a', 'b', 7]

56

|

Chapter 3: Built-In Data Structures, Functions, and Files

In [100]: list(d1.values()) Out[100]: ['some value', [1, 2, 3, 4], 'an integer']

If you need to iterate over both the keys and values, you can use the items method to iterate over the keys and values as 2-tuples: In [101]: list(d1.items()) Out[101]: [('a', 'some value'), ('b', [1, 2, 3, 4]), (7, 'an integer')]

You can merge one dictionary into another using the update method: In [102]: d1.update({"b": "foo", "c": 12}) In [103]: d1 Out[103]: {'a': 'some value', 'b': 'foo', 7: 'an integer', 'c': 12}

The update method changes dictionaries in place, so any existing keys in the data passed to update will have their old values discarded.

Creating dictionaries from sequences It’s common to occasionally end up with two sequences that you want to pair up element-wise in a dictionary. As a first cut, you might write code like this: mapping = {} for key, value in zip(key_list, value_list): mapping[key] = value

Since a dictionary is essentially a collection of 2-tuples, the dict function accepts a list of 2-tuples: In [104]: tuples = zip(range(5), reversed(range(5))) In [105]: tuples Out[105]: In [106]: mapping = dict(tuples) In [107]: mapping Out[107]: {0: 4, 1: 3, 2: 2, 3: 1, 4: 0}

Later we’ll talk about dictionary comprehensions, which are another way to construct dictionaries.

Default values It’s common to have logic like: if key in some_dict: value = some_dict[key] else: value = default_value

3.1 Data Structures and Sequences

|

57

Thus, the dictionary methods get and pop can take a default value to be returned, so that the above if-else block can be written simply as: value = some_dict.get(key, default_value)

get by default will return None if the key is not present, while pop will raise an

exception. With setting values, it may be that the values in a dictionary are another kind of collection, like a list. For example, you could imagine categorizing a list of words by their first letters as a dictionary of lists: In [108]: words = ["apple", "bat", "bar", "atom", "book"] In [109]: by_letter = {} In [110]: for word in words: .....: letter = word[0] .....: if letter not in by_letter: .....: by_letter[letter] = [word] .....: else: .....: by_letter[letter].append(word) .....: In [111]: by_letter Out[111]: {'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}

The setdefault dictionary method can be used to simplify this workflow. The preceding for loop can be rewritten as: In [112]: by_letter = {} In [113]: for word in words: .....: letter = word[0] .....: by_letter.setdefault(letter, []).append(word) .....: In [114]: by_letter Out[114]: {'a': ['apple', 'atom'], 'b': ['bat', 'bar', 'book']}

The built-in collections module has a useful class, defaultdict, which makes this even easier. To create one, you pass a type or function for generating the default value for each slot in the dictionary: In [115]: from collections import defaultdict In [116]: by_letter = defaultdict(list) In [117]: for word in words: .....: by_letter[word[0]].append(word)

58

|

Chapter 3: Built-In Data Structures, Functions, and Files

Valid dictionary key types While the values of a dictionary can be any Python object, the keys generally have to be immutable objects like scalar types (int, float, string) or tuples (all the objects in the tuple need to be immutable, too). The technical term here is hashability. You can check whether an object is hashable (can be used as a key in a dictionary) with the hash function: In [118]: hash("string") Out[118]: 3634226001988967898 In [119]: hash((1, 2, (2, 3))) Out[119]: -9209053662355515447 In [120]: hash((1, 2, [2, 3])) # fails because lists are mutable --------------------------------------------------------------------------TypeError Traceback (most recent call last) in ----> 1 hash((1, 2, [2, 3])) # fails because lists are mutable TypeError: unhashable type: 'list'

The hash values you see when using the hash function in general will depend on the Python version you are using. To use a list as a key, one option is to convert it to a tuple, which can be hashed as long as its elements also can be: In [121]: d = {} In [122]: d[tuple([1, 2, 3])] = 5 In [123]: d Out[123]: {(1, 2, 3): 5}

Set A set is an unordered collection of unique elements. A set can be created in two ways: via the set function or via a set literal with curly braces: In [124]: set([2, 2, 2, 1, 3, 3]) Out[124]: {1, 2, 3} In [125]: {2, 2, 2, 1, 3, 3} Out[125]: {1, 2, 3}

Sets support mathematical set operations like union, intersection, difference, and symmetric difference. Consider these two example sets: In [126]: a = {1, 2, 3, 4, 5} In [127]: b = {3, 4, 5, 6, 7, 8}

3.1 Data Structures and Sequences

|

59

The union of these two sets is the set of distinct elements occurring in either set. This can be computed with either the union method or the | binary operator: In [128]: a.union(b) Out[128]: {1, 2, 3, 4, 5, 6, 7, 8} In [129]: a | b Out[129]: {1, 2, 3, 4, 5, 6, 7, 8}

The intersection contains the elements occurring in both sets. The & operator or the

intersection method can be used: In [130]: a.intersection(b) Out[130]: {3, 4, 5} In [131]: a & b Out[131]: {3, 4, 5}

See Table 3-1 for a list of commonly used set methods. Table 3-1. Python set operations Function a.add(x) a.clear() a.remove(x) a.pop()

Alternative syntax N/A N/A N/A N/A

a.union(b)

a | b

a.update(b)

a |= b

a.intersection(b)

a & b

a.intersection_update(b)

a &= b

a.difference(b)

a - b

a.difference_update(b)

a -= b

a.symmetric_difference(b)

a ^ b

a.symmetric_difference_update(b) a ^= b a.issubset(b)

=

a.isdisjoint(b)

N/A

60

|

Chapter 3: Built-In Data Structures, Functions, and Files

Description Add element x to set a Reset set a to an empty state, discarding all of its elements Remove element x from set a Remove an arbitrary element from set a, raising KeyError if the set is empty All of the unique elements in a and b Set the contents of a to be the union of the elements in a and b All of the elements in both a and b Set the contents of a to be the intersection of the elements in a and b The elements in a that are not in b Set a to the elements in a that are not in b All of the elements in either a or b but not both Set a to contain the elements in either a or b but not both True if the elements of a are all contained in b True if the elements of b are all contained in a True if a and b have no elements in common

If you pass an input that is not a set to methods like union and intersection, Python will convert the input to a set before execut‐ ing the operation. When using the binary operators, both objects must already be sets.

All of the logical set operations have in-place counterparts, which enable you to replace the contents of the set on the left side of the operation with the result. For very large sets, this may be more efficient: In [132]: c = a.copy() In [133]: c |= b In [134]: c Out[134]: {1, 2, 3, 4, 5, 6, 7, 8} In [135]: d = a.copy() In [136]: d &= b In [137]: d Out[137]: {3, 4, 5}

Like dictionary keys, set elements generally must be immutable, and they must be hashable (which means that calling hash on a value does not raise an exception). In order to store list-like elements (or other mutable sequences) in a set, you can convert them to tuples: In [138]: my_data = [1, 2, 3, 4] In [139]: my_set = {tuple(my_data)} In [140]: my_set Out[140]: {(1, 2, 3, 4)}

You can also check if a set is a subset of (is contained in) or a superset of (contains all elements of) another set: In [141]: a_set = {1, 2, 3, 4, 5} In [142]: {1, 2, 3}.issubset(a_set) Out[142]: True In [143]: a_set.issuperset({1, 2, 3}) Out[143]: True

Sets are equal if and only if their contents are equal: In [144]: {1, 2, 3} == {3, 2, 1} Out[144]: True

3.1 Data Structures and Sequences

|

61

Built-In Sequence Functions Python has a handful of useful sequence functions that you should familiarize your‐ self with and use at any opportunity.

enumerate It’s common when iterating over a sequence to want to keep track of the index of the current item. A do-it-yourself approach would look like: index = 0 for value in collection: # do something with value index += 1

Since this is so common, Python has a built-in function, enumerate, which returns a sequence of (i, value) tuples: for index, value in enumerate(collection): # do something with value

sorted The sorted function returns a new sorted list from the elements of any sequence: In [145]: sorted([7, 1, 2, 6, 0, 3, 2]) Out[145]: [0, 1, 2, 2, 3, 6, 7] In [146]: sorted("horse race") Out[146]: [' ', 'a', 'c', 'e', 'e', 'h', 'o', 'r', 'r', 's']

The sorted function accepts the same arguments as the sort method on lists.

zip zip “pairs” up the elements of a number of lists, tuples, or other sequences to create a

list of tuples:

In [147]: seq1 = ["foo", "bar", "baz"] In [148]: seq2 = ["one", "two", "three"] In [149]: zipped = zip(seq1, seq2) In [150]: list(zipped) Out[150]: [('foo', 'one'), ('bar', 'two'), ('baz', 'three')]

zip can take an arbitrary number of sequences, and the number of elements it produces is determined by the shortest sequence: In [151]: seq3 = [False, True]

62

|

Chapter 3: Built-In Data Structures, Functions, and Files

In [152]: list(zip(seq1, seq2, seq3)) Out[152]: [('foo', 'one', False), ('bar', 'two', True)]

A common use of zip is simultaneously iterating over multiple sequences, possibly also combined with enumerate: In [153]: for index, (a, b) in enumerate(zip(seq1, seq2)): .....: print(f"{index}: {a}, {b}") .....: 0: foo, one 1: bar, two 2: baz, three

reversed reversed iterates over the elements of a sequence in reverse order: In [154]: list(reversed(range(10))) Out[154]: [9, 8, 7, 6, 5, 4, 3, 2, 1, 0]

Keep in mind that reversed is a generator (to be discussed in some more detail later), so it does not create the reversed sequence until materialized (e.g., with list or a for loop).

List, Set, and Dictionary Comprehensions List comprehensions are a convenient and widely used Python language feature. They allow you to concisely form a new list by filtering the elements of a collection, transforming the elements passing the filter into one concise expression. They take the basic form: [expr for value in collection if condition]

This is equivalent to the following for loop: result = [] for value in collection: if condition: result.append(expr)

The filter condition can be omitted, leaving only the expression. For example, given a list of strings, we could filter out strings with length 2 or less and convert them to uppercase like this: In [155]: strings = ["a", "as", "bat", "car", "dove", "python"] In [156]: [x.upper() for x in strings if len(x) > 2] Out[156]: ['BAT', 'CAR', 'DOVE', 'PYTHON']

Set and dictionary comprehensions are a natural extension, producing sets and dic‐ tionaries in an idiomatically similar way instead of lists.

3.1 Data Structures and Sequences

|

63

A dictionary comprehension looks like this: dict_comp = {key-expr: value-expr for value in collection if condition}

A set comprehension looks like the equivalent list comprehension except with curly braces instead of square brackets: set_comp = {expr for value in collection if condition}

Like list comprehensions, set and dictionary comprehensions are mostly convenien‐ ces, but they similarly can make code both easier to write and read. Consider the list of strings from before. Suppose we wanted a set containing just the lengths of the strings contained in the collection; we could easily compute this using a set comprehension: In [157]: unique_lengths = {len(x) for x in strings} In [158]: unique_lengths Out[158]: {1, 2, 3, 4, 6}

We could also express this more functionally using the map function, introduced shortly: In [159]: set(map(len, strings)) Out[159]: {1, 2, 3, 4, 6}

As a simple dictionary comprehension example, we could create a lookup map of these strings for their locations in the list: In [160]: loc_mapping = {value: index for index, value in enumerate(strings)} In [161]: loc_mapping Out[161]: {'a': 0, 'as': 1, 'bat': 2, 'car': 3, 'dove': 4, 'python': 5}

Nested list comprehensions Suppose we have a list of lists containing some English and Spanish names: In [162]: all_data = [["John", "Emily", "Michael", "Mary", "Steven"], .....: ["Maria", "Juan", "Javier", "Natalia", "Pilar"]]

Suppose we wanted to get a single list containing all names with two or more a’s in them. We could certainly do this with a simple for loop: In [163]: names_of_interest = [] In [164]: for names in all_data: .....: enough_as = [name for name in names if name.count("a") >= 2] .....: names_of_interest.extend(enough_as) .....: In [165]: names_of_interest Out[165]: ['Maria', 'Natalia']

64

|

Chapter 3: Built-In Data Structures, Functions, and Files

You can actually wrap this whole operation up in a single nested list comprehension, which will look like: In [166]: result = [name for names in all_data for name in names .....: if name.count("a") >= 2] In [167]: result Out[167]: ['Maria', 'Natalia']

At first, nested list comprehensions are a bit hard to wrap your head around. The for parts of the list comprehension are arranged according to the order of nesting, and any filter condition is put at the end as before. Here is another example where we “flatten” a list of tuples of integers into a simple list of integers: In [168]: some_tuples = [(1, 2, 3), (4, 5, 6), (7, 8, 9)] In [169]: flattened = [x for tup in some_tuples for x in tup] In [170]: flattened Out[170]: [1, 2, 3, 4, 5, 6, 7, 8, 9]

Keep in mind that the order of the for expressions would be the same if you wrote a nested for loop instead of a list comprehension: flattened = [] for tup in some_tuples: for x in tup: flattened.append(x)

You can have arbitrarily many levels of nesting, though if you have more than two or three levels of nesting, you should probably start to question whether this makes sense from a code readability standpoint. It’s important to distinguish the syntax just shown from a list comprehension inside a list comprehension, which is also perfectly valid: In [172]: [[x for x in tup] for tup in some_tuples] Out[172]: [[1, 2, 3], [4, 5, 6], [7, 8, 9]]

This produces a list of lists, rather than a flattened list of all of the inner elements.

3.2 Functions Functions are the primary and most important method of code organization and reuse in Python. As a rule of thumb, if you anticipate needing to repeat the same or very similar code more than once, it may be worth writing a reusable function. Functions can also help make your code more readable by giving a name to a group of Python statements.

3.2 Functions

|

65

Functions are declared with the def keyword. A function contains a block of code with an optional use of the return keyword: In [173]: def my_function(x, y): .....: return x + y

When a line with return is reached, the value or expression after return is sent to the context where the function was called, for example: In [174]: my_function(1, 2) Out[174]: 3 In [175]: result = my_function(1, 2) In [176]: result Out[176]: 3

There is no issue with having multiple return statements. If Python reaches the end of a function without encountering a return statement, None is returned automati‐ cally. For example: In [177]: def function_without_return(x): .....: print(x) In [178]: result = function_without_return("hello!") hello! In [179]: print(result) None

Each function can have positional arguments and keyword arguments. Keyword argu‐ ments are most commonly used to specify default values or optional arguments. Here we will define a function with an optional z argument with the default value 1.5: def my_function2(x, y, z=1.5): if z > 1: return z * (x + y) else: return z / (x + y)

While keyword arguments are optional, all positional arguments must be specified when calling a function. You can pass values to the z argument with or without the keyword provided, though using the keyword is encouraged: In [181]: my_function2(5, 6, z=0.7) Out[181]: 0.06363636363636363 In [182]: my_function2(3.14, 7, 3.5) Out[182]: 35.49

66

|

Chapter 3: Built-In Data Structures, Functions, and Files

In [183]: my_function2(10, 20) Out[183]: 45.0

The main restriction on function arguments is that the keyword arguments must follow the positional arguments (if any). You can specify keyword arguments in any order. This frees you from having to remember the order in which the function arguments were specified. You need to remember only what their names are.

Namespaces, Scope, and Local Functions Functions can access variables created inside the function as well as those outside the function in higher (or even global) scopes. An alternative and more descriptive name describing a variable scope in Python is a namespace. Any variables that are assigned within a function by default are assigned to the local namespace. The local namespace is created when the function is called and is immediately populated by the function’s arguments. After the function is finished, the local namespace is destroyed (with some exceptions that are outside the purview of this chapter). Consider the following function: def func(): a = [] for i in range(5): a.append(i)

When func() is called, the empty list a is created, five elements are appended, and then a is destroyed when the function exits. Suppose instead we had declared a as follows: In [184]: a = [] In [185]: def func(): .....: for i in range(5): .....: a.append(i)

Each call to func will modify list a: In [186]: func() In [187]: a Out[187]: [0, 1, 2, 3, 4] In [188]: func() In [189]: a Out[189]: [0, 1, 2, 3, 4, 0, 1, 2, 3, 4]

Assigning variables outside of the function’s scope is possible, but those variables must be declared explicitly using either the global or nonlocal keywords: In [190]: a = None

3.2 Functions

|

67

In [191]: def bind_a_variable(): .....: global a .....: a = [] .....: bind_a_variable() .....: In [192]: print(a) []

nonlocal allows a function to modify variables defined in a higher-level scope that is not global. Since its use is somewhat esoteric (I never use it in this book), I refer you to the Python documentation to learn more about it. I generally discourage use of the global keyword. Typically, global variables are used to store some kind of state in a system. If you find yourself using a lot of them, it may indicate a need for objectoriented programming (using classes).

Returning Multiple Values When I first programmed in Python after having programmed in Java and C++, one of my favorite features was the ability to return multiple values from a function with simple syntax. Here’s an example: def f(): a = 5 b = 6 c = 7 return a, b, c a, b, c = f()

In data analysis and other scientific applications, you may find yourself doing this often. What’s happening here is that the function is actually just returning one object, a tuple, which is then being unpacked into the result variables. In the preceding example, we could have done this instead: return_value = f()

In this case, return_value would be a 3-tuple with the three returned variables. A potentially attractive alternative to returning multiple values like before might be to return a dictionary instead: def f(): a = 5 b = 6 c = 7 return {"a" : a, "b" : b, "c" : c}

This alternative technique can be useful depending on what you are trying to do. 68

|

Chapter 3: Built-In Data Structures, Functions, and Files

Functions Are Objects Since Python functions are objects, many constructs can be easily expressed that are difficult to do in other languages. Suppose we were doing some data cleaning and needed to apply a bunch of transformations to the following list of strings: In [193]: states = [" Alabama ", "Georgia!", "Georgia", "georgia", "FlOrIda", .....: "south carolina##", "West virginia?"]

Anyone who has ever worked with user-submitted survey data has seen messy results like these. Lots of things need to happen to make this list of strings uniform and ready for analysis: stripping whitespace, removing punctuation symbols, and stand‐ ardizing proper capitalization. One way to do this is to use built-in string methods along with the re standard library module for regular expressions: import re def clean_strings(strings): result = [] for value in strings: value = value.strip() value = re.sub("[!#?]", "", value) value = value.title() result.append(value) return result

The result looks like this: In [195]: clean_strings(states) Out[195]: ['Alabama', 'Georgia', 'Georgia', 'Georgia', 'Florida', 'South Carolina', 'West Virginia']

An alternative approach that you may find useful is to make a list of the operations you want to apply to a particular set of strings: def remove_punctuation(value): return re.sub("[!#?]", "", value) clean_ops = [str.strip, remove_punctuation, str.title] def clean_strings(strings, ops): result = [] for value in strings: for func in ops: value = func(value) result.append(value) return result

3.2 Functions

|

69

Then we have the following: In [197]: clean_strings(states, clean_ops) Out[197]: ['Alabama', 'Georgia', 'Georgia', 'Georgia', 'Florida', 'South Carolina', 'West Virginia']

A more functional pattern like this enables you to easily modify how the strings are transformed at a very high level. The clean_strings function is also now more reusable and generic. You can use functions as arguments to other functions like the built-in map function, which applies a function to a sequence of some kind: In [198]: for x in map(remove_punctuation, states): .....: print(x) Alabama Georgia Georgia georgia FlOrIda south carolina West virginia

map can be used as an alternative to list comprehensions without any filter.

Anonymous (Lambda) Functions Python has support for so-called anonymous or lambda functions, which are a way of writing functions consisting of a single statement, the result of which is the return value. They are defined with the lambda keyword, which has no meaning other than “we are declaring an anonymous function”: In [199]: def short_function(x): .....: return x * 2 In [200]: equiv_anon = lambda x: x * 2

I usually refer to these as lambda functions in the rest of the book. They are especially convenient in data analysis because, as you’ll see, there are many cases where data transformation functions will take functions as arguments. It’s often less typing (and clearer) to pass a lambda function as opposed to writing a full-out function declara‐ tion or even assigning the lambda function to a local variable. Consider this example: In [201]: def apply_to_list(some_list, f): .....: return [f(x) for x in some_list]

70

|

Chapter 3: Built-In Data Structures, Functions, and Files

In [202]: ints = [4, 0, 1, 5, 6] In [203]: apply_to_list(ints, lambda x: x * 2) Out[203]: [8, 0, 2, 10, 12]

You could also have written [x * 2 for x in ints], but here we were able to succinctly pass a custom operator to the apply_to_list function. As another example, suppose you wanted to sort a collection of strings by the number of distinct letters in each string: In [204]: strings = ["foo", "card", "bar", "aaaa", "abab"]

Here we could pass a lambda function to the list’s sort method: In [205]: strings.sort(key=lambda x: len(set(x))) In [206]: strings Out[206]: ['aaaa', 'foo', 'abab', 'bar', 'card']

Generators Many objects in Python support iteration, such as over objects in a list or lines in a file. This is accomplished by means of the iterator protocol, a generic way to make objects iterable. For example, iterating over a dictionary yields the dictionary keys: In [207]: some_dict = {"a": 1, "b": 2, "c": 3} In [208]: for key in some_dict: .....: print(key) a b c

When you write for key in some_dict, the Python interpreter first attempts to create an iterator out of some_dict: In [209]: dict_iterator = iter(some_dict) In [210]: dict_iterator Out[210]:

An iterator is any object that will yield objects to the Python interpreter when used in a context like a for loop. Most methods expecting a list or list-like object will also accept any iterable object. This includes built-in methods such as min, max, and sum, and type constructors like list and tuple: In [211]: list(dict_iterator) Out[211]: ['a', 'b', 'c']

3.2 Functions

|

71

A generator is a convenient way, similar to writing a normal function, to construct a new iterable object. Whereas normal functions execute and return a single result at a time, generators can return a sequence of multiple values by pausing and resuming execution each time the generator is used. To create a generator, use the yield keyword instead of return in a function: def squares(n=10): print(f"Generating squares from 1 to {n ** 2}") for i in range(1, n + 1): yield i ** 2

When you actually call the generator, no code is immediately executed: In [213]: gen = squares() In [214]: gen Out[214]:

It is not until you request elements from the generator that it begins executing its code: In [215]: for x in gen: .....: print(x, end=" ") Generating squares from 1 to 100 1 4 9 16 25 36 49 64 81 100

Since generators produce output one element at a time versus an entire list all at once, it can help your program use less memory.

Generator expressions Another way to make a generator is by using a generator expression. This is a genera‐ tor analogue to list, dictionary, and set comprehensions. To create one, enclose what would otherwise be a list comprehension within parentheses instead of brackets: In [216]: gen = (x ** 2 for x in range(100)) In [217]: gen Out[217]:

This is equivalent to the following more verbose generator: def _make_gen(): for x in range(100): yield x ** 2 gen = _make_gen()

Generator expressions can be used instead of list comprehensions as function argu‐ ments in some cases: 72

| Chapter 3: Built-In Data Structures, Functions, and Files

In [218]: sum(x ** 2 for x in range(100)) Out[218]: 328350 In [219]: dict((i, i ** 2) for i in range(5)) Out[219]: {0: 0, 1: 1, 2: 4, 3: 9, 4: 16}

Depending on the number of elements produced by the comprehension expression, the generator version can sometimes be meaningfully faster.

itertools module The standard library itertools module has a collection of generators for many common data algorithms. For example, groupby takes any sequence and a function, grouping consecutive elements in the sequence by return value of the function. Here’s an example: In [220]: import itertools In [221]: def first_letter(x): .....: return x[0] In [222]: names = ["Alan", "Adam", "Wes", "Will", "Albert", "Steven"] In [223]: for letter, names in itertools.groupby(names, first_letter): .....: print(letter, list(names)) # names is a generator A ['Alan', 'Adam'] W ['Wes', 'Will'] A ['Albert'] S ['Steven']

See Table 3-2 for a list of a few other itertools functions I’ve frequently found helpful. You may like to check out the official Python documentation for more on this useful built-in utility module. Table 3-2. Some useful itertools functions Function

Description Generates a sequence by chaining iterators together. Once elements from the first iterator are exhausted, elements from the next iterator are returned, and so on. combinations(iterable, k) Generates a sequence of all possible k-tuples of elements in the iterable, ignoring order and without replacement (see also the companion function combinations_with_replacement). permutations(iterable, k) Generates a sequence of all possible k-tuples of elements in the iterable, respecting order. groupby(iterable[, keyfunc]) Generates (key, sub-iterator) for each unique key. product(*iterables, repeat=1) Generates the Cartesian product of the input iterables as tuples, similar to a nested for loop. chain(*iterables)

3.2 Functions

|

73

Errors and Exception Handling Handling Python errors or exceptions gracefully is an important part of building robust programs. In data analysis applications, many functions work only on certain kinds of input. As an example, Python’s float function is capable of casting a string to a floating-point number, but it fails with ValueError on improper inputs: In [224]: float("1.2345") Out[224]: 1.2345 In [225]: float("something") --------------------------------------------------------------------------ValueError Traceback (most recent call last) in ----> 1 float("something") ValueError: could not convert string to float: 'something'

Suppose we wanted a version of float that fails gracefully, returning the input argument. We can do this by writing a function that encloses the call to float in a try/except block (execute this code in IPython): def attempt_float(x): try: return float(x) except: return x

The code in the except part of the block will only be executed if float(x) raises an exception: In [227]: attempt_float("1.2345") Out[227]: 1.2345 In [228]: attempt_float("something") Out[228]: 'something'

You might notice that float can raise exceptions other than ValueError: In [229]: float((1, 2)) --------------------------------------------------------------------------TypeError Traceback (most recent call last) in ----> 1 float((1, 2)) TypeError: float() argument must be a string or a real number, not 'tuple'

You might want to suppress only ValueError, since a TypeError (the input was not a string or numeric value) might indicate a legitimate bug in your program. To do that, write the exception type after except: def attempt_float(x): try: return float(x)

74

|

Chapter 3: Built-In Data Structures, Functions, and Files

except ValueError: return x

We have then: In [231]: attempt_float((1, 2)) --------------------------------------------------------------------------TypeError Traceback (most recent call last) in ----> 1 attempt_float((1, 2)) in attempt_float(x) 1 def attempt_float(x): 2 try: ----> 3 return float(x) 4 except ValueError: 5 return x TypeError: float() argument must be a string or a real number, not 'tuple'

You can catch multiple exception types by writing a tuple of exception types instead (the parentheses are required): def attempt_float(x): try: return float(x) except (TypeError, ValueError): return x

In some cases, you may not want to suppress an exception, but you want some code to be executed regardless of whether or not the code in the try block succeeds. To do this, use finally: f = open(path, mode="w") try: write_to_file(f) finally: f.close()

Here, the file object f will always get closed. Similarly, you can have code that executes only if the try: block succeeds using else: f = open(path, mode="w") try: write_to_file(f) except: print("Failed") else: print("Succeeded") finally: f.close()

3.2 Functions

|

75

Exceptions in IPython If an exception is raised while you are %run-ing a script or executing any statement, IPython will by default print a full call stack trace (traceback) with a few lines of context around the position at each point in the stack: In [10]: %run examples/ipython_bug.py --------------------------------------------------------------------------AssertionError Traceback (most recent call last) /home/wesm/code/pydata-book/examples/ipython_bug.py in () 13 throws_an_exception() 14 ---> 15 calling_things() /home/wesm/code/pydata-book/examples/ipython_bug.py in calling_things() 11 def calling_things(): 12 works_fine() ---> 13 throws_an_exception() 14 15 calling_things() /home/wesm/code/pydata-book/examples/ipython_bug.py in throws_an_exception() 7 a = 5 8 b = 6 ----> 9 assert(a + b == 10) 10 11 def calling_things(): AssertionError:

Having additional context by itself is a big advantage over the standard Python interpreter (which does not provide any additional context). You can control the amount of context shown using the %xmode magic command, from Plain (same as the standard Python interpreter) to Verbose (which inlines function argument values and more). As you will see later in Appendix B, you can step into the stack (using the %debug or %pdb magics) after an error has occurred for interactive postmortem debugging.

3.3 Files and the Operating System Most of this book uses high-level tools like pandas.read_csv to read data files from disk into Python data structures. However, it’s important to understand the basics of how to work with files in Python. Fortunately, it’s relatively straightforward, which is one reason Python is so popular for text and file munging. To open a file for reading or writing, use the built-in open function with either a relative or absolute file path and an optional file encoding:

76

| Chapter 3: Built-In Data Structures, Functions, and Files

In [233]: path = "examples/segismundo.txt" In [234]: f = open(path, encoding="utf-8")

Here, I pass encoding="utf-8" as a best practice because the default Unicode encod‐ ing for reading files varies from platform to platform. By default, the file is opened in read-only mode "r". We can then treat the file object f like a list and iterate over the lines like so: for line in f: print(line)

The lines come out of the file with the end-of-line (EOL) markers intact, so you’ll often see code to get an EOL-free list of lines in a file like: In [235]: lines = [x.rstrip() for x in open(path, encoding="utf-8")] In [236]: lines Out[236]: ['Sueña el rico en su riqueza,', 'que más cuidados le ofrece;', '', 'sueña el pobre que padece', 'su miseria y su pobreza;', '', 'sueña el que a medrar empieza,', 'sueña el que afana y pretende,', 'sueña el que agravia y ofende,', '', 'y en el mundo, en conclusión,', 'todos sueñan lo que son,', 'aunque ninguno lo entiende.', '']

When you use open to create file objects, it is recommended to close the file when you are finished with it. Closing the file releases its resources back to the operating system: In [237]: f.close()

One of the ways to make it easier to clean up open files is to use the with statement: In [238]: with open(path, encoding="utf-8") as f: .....: lines = [x.rstrip() for x in f]

This will automatically close the file f when exiting the with block. Failing to ensure that files are closed will not cause problems in many small programs or scripts, but it can be an issue in programs that need to interact with a large number of files. If we had typed f = open(path, "w"), a new file at examples/segismundo.txt would have been created (be careful!), overwriting any file in its place. There is also the

3.3 Files and the Operating System

|

77

"x" file mode, which creates a writable file but fails if the file path already exists. See Table 3-3 for a list of all valid file read/write modes.

Table 3-3. Python file modes Mode Description Read-only mode r Write-only mode; creates a new file (erasing the data for any file with the same name) w Write-only mode; creates a new file but fails if the file path already exists x Append to existing file (creates the file if it does not already exist) a Read and write r+ b Add to mode for binary files (i.e., "rb" or "wb") Text mode for files (automatically decoding bytes to Unicode); this is the default if not specified t

For readable files, some of the most commonly used methods are read, seek, and tell. read returns a certain number of characters from the file. What constitutes a “character” is determined by the file encoding or simply raw bytes if the file is opened in binary mode: In [239]: f1 = open(path) In [240]: f1.read(10) Out[240]: 'Sueña el r' In [241]: f2 = open(path, mode="rb")

# Binary mode

In [242]: f2.read(10) Out[242]: b'Sue\xc3\xb1a el '

The read method advances the file object position by the number of bytes read. tell gives you the current position: In [243]: f1.tell() Out[243]: 11 In [244]: f2.tell() Out[244]: 10

Even though we read 10 characters from the file f1 opened in text mode, the position is 11 because it took that many bytes to decode 10 characters using the default encoding. You can check the default encoding in the sys module: In [245]: import sys In [246]: sys.getdefaultencoding() Out[246]: 'utf-8'

To get consistent behavior across platforms, it is best to pass an encoding (such as encoding="utf-8", which is widely used) when opening files. 78

|

Chapter 3: Built-In Data Structures, Functions, and Files

seek changes the file position to the indicated byte in the file: In [247]: f1.seek(3) Out[247]: 3 In [248]: f1.read(1) Out[248]: 'ñ' In [249]: f1.tell() Out[249]: 5

Lastly, we remember to close the files: In [250]: f1.close() In [251]: f2.close()

To write text to a file, you can use the file’s write or writelines methods. For example, we could create a version of examples/segismundo.txt with no blank lines like so: In [252]: path Out[252]: 'examples/segismundo.txt' In [253]: with open("tmp.txt", mode="w") as handle: .....: handle.writelines(x for x in open(path) if len(x) > 1) In [254]: with open("tmp.txt") as f: .....: lines = f.readlines() In [255]: lines Out[255]: ['Sueña el rico en su riqueza,\n', 'que más cuidados le ofrece;\n', 'sueña el pobre que padece\n', 'su miseria y su pobreza;\n', 'sueña el que a medrar empieza,\n', 'sueña el que afana y pretende,\n', 'sueña el que agravia y ofende,\n', 'y en el mundo, en conclusión,\n', 'todos sueñan lo que son,\n', 'aunque ninguno lo entiende.\n']

See Table 3-4 for many of the most commonly used file methods. Table 3-4. Important Python file methods or attributes Method/attribute read([size]) readable() readlines([size])

Description Return data from file as bytes or string depending on the file mode, with optional size argument indicating the number of bytes or string characters to read Return True if the file supports read operations Return list of lines in the file, with optional size argument

3.3 Files and the Operating System

|

79

Method/attribute

Description Write passed string to file write(string) writable() Return True if the file supports write operations writelines(strings) Write passed sequence of strings to the file Close the file object close() Flush the internal I/O buffer to disk flush() Move to indicated file position (integer) seek(pos) seekable() Return True if the file object supports seeking and thus random access (some file-like objects do not) Return current file position as integer tell() closed True if the file is closed The encoding used to interpret bytes in the file as Unicode (typically UTF-8) encoding

Bytes and Unicode with Files The default behavior for Python files (whether readable or writable) is text mode, which means that you intend to work with Python strings (i.e., Unicode). This contrasts with binary mode, which you can obtain by appending b to the file mode. Revisiting the file (which contains non-ASCII characters with UTF-8 encoding) from the previous section, we have: In [258]: with open(path) as f: .....: chars = f.read(10) In [259]: chars Out[259]: 'Sueña el r' In [260]: len(chars) Out[260]: 10

UTF-8 is a variable-length Unicode encoding, so when I request some number of characters from the file, Python reads enough bytes (which could be as few as 10 or as many as 40 bytes) from the file to decode that many characters. If I open the file in "rb" mode instead, read requests that exact number of bytes: In [261]: with open(path, mode="rb") as f: .....: data = f.read(10) In [262]: data Out[262]: b'Sue\xc3\xb1a el '

Depending on the text encoding, you may be able to decode the bytes to a str object yourself, but only if each of the encoded Unicode characters is fully formed: In [263]: data.decode("utf-8") Out[263]: 'Sueña el ' In [264]: data[:4].decode("utf-8")

80

|

Chapter 3: Built-In Data Structures, Functions, and Files

--------------------------------------------------------------------------UnicodeDecodeError Traceback (most recent call last) in ----> 1 data[:4].decode("utf-8") UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 3: unexpecte d end of data

Text mode, combined with the encoding option of open, provides a convenient way to convert from one Unicode encoding to another: In [265]: sink_path = "sink.txt" In [266]: with open(path) as source: .....: with open(sink_path, "x", encoding="iso-8859-1") as sink: .....: sink.write(source.read()) In [267]: with open(sink_path, encoding="iso-8859-1") as f: .....: print(f.read(10)) Sueña el r

Beware using seek when opening files in any mode other than binary. If the file position falls in the middle of the bytes defining a Unicode character, then subsequent reads will result in an error: In [269]: f = open(path, encoding='utf-8') In [270]: f.read(5) Out[270]: 'Sueña' In [271]: f.seek(4) Out[271]: 4 In [272]: f.read(1) --------------------------------------------------------------------------UnicodeDecodeError Traceback (most recent call last) in ----> 1 f.read(1) /miniconda/envs/book-env/lib/python3.10/codecs.py in decode(self, input, final) 320 # decode input (taking the buffer into account) 321 data = self.buffer + input --> 322 (result, consumed) = self._buffer_decode(data, self.errors, final ) 323 # keep undecoded input until the next call 324 self.buffer = data[consumed:] UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb1 in position 0: invalid s tart byte In [273]: f.close()

If you find yourself regularly doing data analysis on non-ASCII text data, mastering Python’s Unicode functionality will prove valuable. See Python’s online documenta‐ tion for much more.

3.3 Files and the Operating System

|

81

3.4 Conclusion With some of the basics of the Python environment and language now under your belt, it is time to move on and learn about NumPy and array-oriented computing in Python.

82

|

Chapter 3: Built-In Data Structures, Functions, and Files

CHAPTER 4

NumPy Basics: Arrays and Vectorized Computation

NumPy, short for Numerical Python, is one of the most important foundational pack‐ ages for numerical computing in Python. Many computational packages providing scientific functionality use NumPy’s array objects as one of the standard interface lingua francas for data exchange. Much of the knowledge about NumPy that I cover is transferable to pandas as well. Here are some of the things you’ll find in NumPy: • ndarray, an efficient multidimensional array providing fast array-oriented arith‐ metic operations and flexible broadcasting capabilities • Mathematical functions for fast operations on entire arrays of data without hav‐ ing to write loops • Tools for reading/writing array data to disk and working with memory-mapped files • Linear algebra, random number generation, and Fourier transform capabilities • A C API for connecting NumPy with libraries written in C, C++, or FORTRAN Because NumPy provides a comprehensive and well-documented C API, it is straightforward to pass data to external libraries written in a low-level language, and for external libraries to return data to Python as NumPy arrays. This feature has made Python a language of choice for wrapping legacy C, C++, or FORTRAN codebases and giving them a dynamic and accessible interface. While NumPy by itself does not provide modeling or scientific functionality, having an understanding of NumPy arrays and array-oriented computing will help you use tools with array computing semantics, like pandas, much more effectively. Since 83

NumPy is a large topic, I will cover many advanced NumPy features like broadcasting in more depth later (see Appendix A). Many of these advanced features are not needed to follow the rest of this book, but they may help you as you go deeper into scientific computing in Python. For most data analysis applications, the main areas of functionality I’ll focus on are: • Fast array-based operations for data munging and cleaning, subsetting and filter‐ ing, transformation, and any other kind of computation • Common array algorithms like sorting, unique, and set operations • Efficient descriptive statistics and aggregating/summarizing data • Data alignment and relational data manipulations for merging and joining heter‐ ogeneous datasets • Expressing conditional logic as array expressions instead of loops with if-elifelse branches • Group-wise data manipulations (aggregation, transformation, and function application) While NumPy provides a computational foundation for general numerical data processing, many readers will want to use pandas as the basis for most kinds of statistics or analytics, especially on tabular data. Also, pandas provides some more domain-specific functionality like time series manipulation, which is not present in NumPy. Array-oriented computing in Python traces its roots back to 1995, when Jim Hugunin created the Numeric library. Over the next 10 years, many scientific programming communities began doing array programming in Python, but the library ecosystem had become fragmented in the early 2000s. In 2005, Travis Oliphant was able to forge the NumPy project from the then Numeric and Numarray projects to bring the community together around a sin‐ gle array computing framework.

One of the reasons NumPy is so important for numerical computations in Python is because it is designed for efficiency on large arrays of data. There are a number of reasons for this: • NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy’s library of algorithms written in the C lan‐ guage can operate on this memory without any type checking or other overhead. NumPy arrays also use much less memory than built-in Python sequences.

84

|

Chapter 4: NumPy Basics: Arrays and Vectorized Computation

• NumPy operations perform complex computations on entire arrays without the need for Python for loops, which can be slow for large sequences. NumPy is faster than regular Python code because its C-based algorithms avoid overhead present with regular interpreted Python code. To give you an idea of the performance difference, consider a NumPy array of one million integers, and the equivalent Python list: In [7]: import numpy as np In [8]: my_arr = np.arange(1_000_000) In [9]: my_list = list(range(1_000_000))

Now let’s multiply each sequence by 2: In [10]: %timeit my_arr2 = my_arr * 2 715 us +- 13.2 us per loop (mean +- std. dev. of 7 runs, 1000 loops each) In [11]: %timeit my_list2 = [x * 2 for x in my_list] 48.8 ms +- 298 us per loop (mean +- std. dev. of 7 runs, 10 loops each)

NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory.

4.1 The NumPy ndarray: A Multidimensional Array Object One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements. To give you a flavor of how NumPy enables batch computations with similar syntax to scalar values on built-in Python objects, I first import NumPy and create a small array: In [12]: import numpy as np In [13]: data = np.array([[1.5, -0.1, 3], [0, -3, 6.5]]) In [14]: data Out[14]: array([[ 1.5, -0.1, [ 0. , -3. ,

3. ], 6.5]])

I then write mathematical operations with data: In [15]: data * 10 Out[15]: array([[ 15., -1., 30.], [ 0., -30., 65.]])

4.1 The NumPy ndarray: A Multidimensional Array Object

|

85

In [16]: data + data Out[16]: array([[ 3. , -0.2, 6. ], [ 0. , -6. , 13. ]])

In the first example, all of the elements have been multiplied by 10. In the second, the corresponding values in each “cell” in the array have been added to each other. In this chapter and throughout the book, I use the standard NumPy convention of always using import numpy as np. It would be possible to put from numpy import * in your code to avoid having to write np., but I advise against making a habit of this. The numpy namespace is large and contains a number of functions whose names conflict with built-in Python functions (like min and max). Following standard conventions like these is almost always a good idea.

An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array: In [17]: data.shape Out[17]: (2, 3) In [18]: data.dtype Out[18]: dtype('float64')

This chapter will introduce you to the basics of using NumPy arrays, and it should be sufficient for following along with the rest of the book. While it’s not necessary to have a deep understanding of NumPy for many data analytical applications, becom‐ ing proficient in array-oriented programming and thinking is a key step along the way to becoming a scientific Python guru. Whenever you see “array,” “NumPy array,” or “ndarray” in the book text, in most cases they all refer to the ndarray object.

Creating ndarrays The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data. For example, a list is a good candidate for conversion: In [19]: data1 = [6, 7.5, 8, 0, 1]

86

|

Chapter 4: NumPy Basics: Arrays and Vectorized Computation

In [20]: arr1 = np.array(data1) In [21]: arr1 Out[21]: array([6. , 7.5, 8. , 0. , 1. ])

Nested sequences, like a list of equal-length lists, will be converted into a multidimen‐ sional array: In [22]: data2 = [[1, 2, 3, 4], [5, 6, 7, 8]] In [23]: arr2 = np.array(data2) In [24]: arr2 Out[24]: array([[1, 2, 3, 4], [5, 6, 7, 8]])

Since data2 was a list of lists, the NumPy array arr2 has two dimensions, with shape inferred from the data. We can confirm this by inspecting the ndim and shape attributes: In [25]: arr2.ndim Out[25]: 2 In [26]: arr2.shape Out[26]: (2, 4)

Unless explicitly specified (discussed in “Data Types for ndarrays” on page 88), numpy.array tries to infer a good data type for the array that it creates. The data type is stored in a special dtype metadata object; for example, in the previous two examples we have: In [27]: arr1.dtype Out[27]: dtype('float64') In [28]: arr2.dtype Out[28]: dtype('int64')

In addition to numpy.array, there are a number of other functions for creating new arrays. As examples, numpy.zeros and numpy.ones create arrays of 0s or 1s, respectively, with a given length or shape. numpy.empty creates an array without initializing its values to any particular value. To create a higher dimensional array with these methods, pass a tuple for the shape: In [29]: np.zeros(10) Out[29]: array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]) In [30]: np.zeros((3, 6)) Out[30]: array([[0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0.]])

4.1 The NumPy ndarray: A Multidimensional Array Object

|

87

In [31]: np.empty((2, 3, 2)) Out[31]: array([[[0., 0.], [0., 0.], [0., 0.]], [[0., 0.], [0., 0.], [0., 0.]]])

It’s not safe to assume that numpy.empty will return an array of all zeros. This function returns uninitialized memory and thus may contain nonzero “garbage” values. You should use this function only if you intend to populate the new array with data.

numpy.arange is an array-valued version of the built-in Python range function: In [32]: np.arange(15) Out[32]: array([ 0, 1,

2, 3,

4,

5,

6,

7,

8,

9, 10, 11, 12, 13, 14])

See Table 4-1 for a short list of standard array creation functions. Since NumPy is focused on numerical computing, the data type, if not specified, will in many cases be float64 (floating point). Table 4-1. Some important NumPy array creation functions Function array asarray arange ones, ones_like zeros, zeros_like empty, empty_like

Description Convert input data (list, tuple, array, or other sequence type) to an ndarray either by inferring a data type or explicitly specifying a data type; copies the input data by default Convert input to ndarray, but do not copy if the input is already an ndarray Like the built-in range but returns an ndarray instead of a list Produce an array of all 1s with the given shape and data type; ones_like takes another array and produces a ones array of the same shape and data type Like ones and ones_like but producing arrays of 0s instead Create new arrays by allocating new memory, but do not populate with any values like ones and zeros

Produce an array of the given shape and data type with all values set to the indicated “fill value”;

full, full_like

full_like takes another array and produces a filled array of the same shape and data type

eye, identity

Create a square N × N identity matrix (1s on the diagonal and 0s elsewhere)

Data Types for ndarrays The data type or dtype is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data:

88

|

Chapter 4: NumPy Basics: Arrays and Vectorized Computation

In [33]: arr1 = np.array([1, 2, 3], dtype=np.float64) In [34]: arr2 = np.array([1, 2, 3], dtype=np.int32) In [35]: arr1.dtype Out[35]: dtype('float64') In [36]: arr2.dtype Out[36]: dtype('int32')

Data types are a source of NumPy’s flexibility for interacting with data coming from other systems. In most cases they provide a mapping directly onto an underlying disk or memory representation, which makes it possible to read and write binary streams of data to disk and to connect to code written in a low-level language like C or FORTRAN. The numerical data types are named the same way: a type name, like float or int, followed by a number indicating the number of bits per element. A standard double-precision floating-point value (what’s used under the hood in Python’s float object) takes up 8 bytes or 64 bits. Thus, this type is known in NumPy as float64. See Table 4-2 for a full listing of NumPy’s supported data types. Don’t worry about memorizing the NumPy data types, especially if you’re a new user. It’s often only necessary to care about the general kind of data you’re dealing with, whether floating point, complex, integer, Boolean, string, or general Python object. When you need more control over how data is stored in memory and on disk, especially large datasets, it is good to know that you have control over the storage type.

Table 4-2. NumPy data types Type

Type code

int8, uint8

i1, u1

int16, uint16

i2, u2

int32, uint32

i4, u4

int64, uint64

i8, u8

float16

f2

float32

f4 or f

float64

f8 or d

float128

f16 or g

complex64, complex128, complex256

c8, c16, c32

bool

? O

object

Description Signed and unsigned 8-bit (1 byte) integer types Signed and unsigned 16-bit integer types Signed and unsigned 32-bit integer types Signed and unsigned 64-bit integer types Half-precision floating point Standard single-precision floating point; compatible with C float Standard double-precision floating point; compatible with C double and Python float object Extended-precision floating point Complex numbers represented by two 32, 64, or 128 floats, respectively

Boolean type storing True and False values Python object type; a value can be any Python object

4.1 The NumPy ndarray: A Multidimensional Array Object

|

89

Type string_

Type code S

unicode_

U

Description Fixed-length ASCII string type (1 byte per character); for example, to create a string data type with length 10, use 'S10' Fixed-length Unicode type (number of bytes platform specific); same specification semantics as string_ (e.g., 'U10')

There are both signed and unsigned integer types, and many readers will not be familiar with this terminology. A signed integer can represent both positive and negative integers, while an unsigned integer can only represent nonzero integers. For example, int8 (signed 8-bit integer) can represent integers from -128 to 127 (inclusive), while uint8 (unsigned 8-bit integer) can represent 0 through 255.

You can explicitly convert or cast an array from one data type to another using ndarray’s astype method: In [37]: arr = np.array([1, 2, 3, 4, 5]) In [38]: arr.dtype Out[38]: dtype('int64') In [39]: float_arr = arr.astype(np.float64) In [40]: float_arr Out[40]: array([1., 2., 3., 4., 5.]) In [41]: float_arr.dtype Out[41]: dtype('float64')

In this example, integers were cast to floating point. If I cast some floating-point numbers to be of integer data type, the decimal part will be truncated: In [42]: arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1]) In [43]: arr Out[43]: array([ 3.7, -1.2, -2.6,

0.5, 12.9, 10.1])

In [44]: arr.astype(np.int32) Out[44]: array([ 3, -1, -2, 0, 12, 10], dtype=int32)

If you have an array of strings representing numbers, you can use astype to convert them to numeric form: In [45]: numeric_strings = np.array(["1.25", "-9.6", "42"], dtype=np.string_) In [46]: numeric_strings.astype(float) Out[46]: array([ 1.25, -9.6 , 42. ])

90

|

Chapter 4: NumPy Basics: Arrays and Vectorized Computation

Be cautious when using the numpy.string_ type, as string data in NumPy is fixed size and may truncate input without warning. pan‐ das has more intuitive out-of-the-box behavior on non-numeric data.

If casting were to fail for some reason (like a string that cannot be converted to float64), a ValueError will be raised. Before, I was a bit lazy and wrote float instead of np.float64; NumPy aliases the Python types to its own equivalent data types. You can also use another array’s dtype attribute: In [47]: int_array = np.arange(10) In [48]: calibers = np.array([.22, .270, .357, .380, .44, .50], dtype=np.float64) In [49]: int_array.astype(calibers.dtype) Out[49]: array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

There are shorthand type code strings you can also use to refer to a dtype: In [50]: zeros_uint32 = np.zeros(8, dtype="u4") In [51]: zeros_uint32 Out[51]: array([0, 0, 0, 0, 0, 0, 0, 0], dtype=uint32)

Calling astype always creates a new array (a copy of the data), even if the new data type is the same as the old data type.

Arithmetic with NumPy Arrays Arrays are important because they enable you to express batch operations on data without writing any for loops. NumPy users call this vectorization. Any arithmetic operations between equal-size arrays apply the operation element-wise: In [52]: arr = np.array([[1., 2., 3.], [4., 5., 6.]]) In [53]: arr Out[53]: array([[1., 2., 3.], [4., 5., 6.]]) In [54]: arr * arr Out[54]: array([[ 1., 4., 9.], [16., 25., 36.]])

4.1 The NumPy ndarray: A Multidimensional Array Object

|

91

In [55]: arr - arr Out[55]: array([[0., 0., 0.], [0., 0., 0.]])

Arithmetic operations with scalars propagate the scalar argument to each element in the array: In [56]: 1 / arr Out[56]: array([[1. , 0.5 [0.25 , 0.2

, 0.3333], , 0.1667]])

In [57]: arr ** 2 Out[57]: array([[ 1., 4., 9.], [16., 25., 36.]])

Comparisons between arrays of the same size yield Boolean arrays: In [58]: arr2 = np.array([[0., 4., 1.], [7., 2., 12.]]) In [59]: arr2 Out[59]: array([[ 0., 4., 1.], [ 7., 2., 12.]]) In [60]: arr2 > arr Out[60]: array([[False, True, False], [ True, False, True]])

Evaluating operations between differently sized arrays is called broadcasting and will be discussed in more detail in Appendix A. Having a deep understanding of broadcasting is not necessary for most of this book.

Basic Indexing and Slicing NumPy array indexing is a deep topic, as there are many ways you may want to select a subset of your data or individual elements. One-dimensional arrays are simple; on the surface they act similarly to Python lists: In [61]: arr = np.arange(10) In [62]: arr Out[62]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9]) In [63]: arr[5] Out[63]: 5 In [64]: arr[5:8] Out[64]: array([5, 6, 7])

92

| Chapter 4: NumPy Basics: Arrays and Vectorized Computation

In [65]: arr[5:8] = 12 In [66]: arr Out[66]: array([ 0,

1,

2, 3,

4, 12, 12, 12,

8,

9])

As you can see, if you assign a scalar value to a slice, as in arr[5:8] = 12, the value is propagated (or broadcast henceforth) to the entire selection. An important first distinction from Python’s built-in lists is that array slices are views on the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source array.

To give an example of this, I first create a slice of arr: In [67]: arr_slice = arr[5:8] In [68]: arr_slice Out[68]: array([12, 12, 12])

Now, when I change values in arr_slice, the mutations are reflected in the original array arr: In [69]: arr_slice[1] = 12345 In [70]: arr Out[70]: array([ 0, 9])

1,

2,

3,

4,

12, 12345,

12,

8,

The “bare” slice [:] will assign to all values in an array: In [71]: arr_slice[:] = 64 In [72]: arr Out[72]: array([ 0,

1,

2,

3,

4, 64, 64, 64,

8,

9])

If you are new to NumPy, you might be surprised by this, especially if you have used other array programming languages that copy data more eagerly. As NumPy has been designed to be able to work with very large arrays, you could imagine performance and memory problems if NumPy insisted on always copying data. If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array—for example, arr[5:8].copy(). As you will see, pandas works this way, too.

4.1 The NumPy ndarray: A Multidimensional Array Object

|

93

With higher dimensional arrays, you have many more options. In a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays: In [73]: arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) In [74]: arr2d[2] Out[74]: array([7, 8, 9])

Thus, individual elements can be accessed recursively. But that is a bit too much work, so you can pass a comma-separated list of indices to select individual elements. So these are equivalent: In [75]: arr2d[0][2] Out[75]: 3 In [76]: arr2d[0, 2] Out[76]: 3

See Figure 4-1 for an illustration of indexing on a two-dimensional array. I find it helpful to think of axis 0 as the “rows” of the array and axis 1 as the “columns.”

Figure 4-1. Indexing elements in a NumPy array In multidimensional arrays, if you omit later indices, the returned object will be a lower dimensional ndarray consisting of all the data along the higher dimensions. So in the 2 × 2 × 3 array arr3d: In [77]: arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]]) In [78]: arr3d Out[78]: array([[[ 1, 2, 3], [ 4, 5, 6]], [[ 7, 8, 9], [10, 11, 12]]])

arr3d[0] is a 2 × 3 array: 94

|

Chapter 4: NumPy Basics: Arrays and Vectorized Computation

In [79]: arr3d[0] Out[79]: array([[1, 2, 3], [4, 5, 6]])

Both scalar values and arrays can be assigned to arr3d[0]: In [80]: old_values = arr3d[0].copy() In [81]: arr3d[0] = 42 In [82]: arr3d Out[82]: array([[[42, 42, 42], [42, 42, 42]], [[ 7, 8, 9], [10, 11, 12]]]) In [83]: arr3d[0] = old_values In [84]: arr3d Out[84]: array([[[ 1, 2, 3], [ 4, 5, 6]], [[ 7, 8, 9], [10, 11, 12]]])

Similarly, arr3d[1, 0] gives you all of the values whose indices start with (1, 0), forming a one-dimensional array: In [85]: arr3d[1, 0] Out[85]: array([7, 8, 9])

This expression is the same as though we had indexed in two steps: In [86]: x = arr3d[1] In [87]: x Out[87]: array([[ 7, 8, 9], [10, 11, 12]]) In [88]: x[0] Out[88]: array([7, 8, 9])

Note that in all of these cases where subsections of the array have been selected, the returned arrays are views. This multidimensional indexing syntax for NumPy arrays will not work with regular Python objects, such as lists of lists.

4.1 The NumPy ndarray: A Multidimensional Array Object

|

95

Indexing with slices Like one-dimensional objects such as Python lists, ndarrays can be sliced with the familiar syntax: In [89]: arr Out[89]: array([ 0,

1,

2,

3,

In [90]: arr[1:6] Out[90]: array([ 1,

2,

3,

4, 64])

4, 64, 64, 64,

8,

9])

Consider the two-dimensional array from before, arr2d. Slicing this array is a bit different: In [91]: arr2d Out[91]: array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) In [92]: arr2d[:2] Out[92]: array([[1, 2, 3], [4, 5, 6]])

As you can see, it has sliced along axis 0, the first axis. A slice, therefore, selects a range of elements along an axis. It can be helpful to read the expression arr2d[:2] as “select the first two rows of arr2d.” You can pass multiple slices just like you can pass multiple indexes: In [93]: arr2d[:2, 1:] Out[93]: array([[2, 3], [5, 6]])

When slicing like this, you always obtain array views of the same number of dimen‐ sions. By mixing integer indexes and slices, you get lower dimensional slices. For example, I can select the second row but only the first two columns, like so: In [94]: lower_dim_slice = arr2d[1, :2]

Here, while arr2d is two-dimensional, lower_dim_slice is one-dimensional, and its shape is a tuple with one axis size: In [95]: lower_dim_slice.shape Out[95]: (2,)

Similarly, I can select the third column but only the first two rows, like so: In [96]: arr2d[:2, 2] Out[96]: array([3, 6])

96

|

Chapter 4: NumPy Basics: Arrays and Vectorized Computation

See Figure 4-2 for an illustration. Note that a colon by itself means to take the entire axis, so you can slice only higher dimensional axes by doing: In [97]: arr2d[:, :1] Out[97]: array([[1], [4], [7]])

Of course, assigning to a slice expression assigns to the whole selection: In [98]: arr2d[:2, 1:] = 0 In [99]: arr2d Out[99]: array([[1, 0, 0], [4, 0, 0], [7, 8, 9]])

Figure 4-2. Two-dimensional array slicing

Boolean Indexing Let’s consider an example where we have some data in an array and an array of names with duplicates:

4.1 The NumPy ndarray: A Multidimensional Array Object

|

97

In [100]: names = np.array(["Bob", "Joe", "Will", "Bob", "Will", "Joe", "Joe"]) In [101]: data = np.array([[4, 7], [0, 2], [-5, 6], [0, 0], [1, 2], .....: [-12, -4], [3, 4]]) In [102]: names Out[102]: array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='