Software Engineering for Data Scientists: From Notebooks to Scalable Systems
9781098136208
Data science happens in code. The ability to write reproducible, robust, scaleable code is key to a data science project
152
43
4MB
English
Pages 400
Year 2024
Report DMCA / Copyright
DOWNLOAD EPUB FILE
Table of contents :
Preface
Who Is This Book For?
Why Python?
What Is Not in This Book
Guide to This Book
Reading Order
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
1. What Is Good Code?
Why Good Code Matters
Adapting to Changing Requirements
Simplicity
Don’t Repeat Yourself (DRY)
Avoid Verbose Code
Modularity
Readability
Standards and Conventions
Names
Cleaning up
Documentation
Performance
Robustness
Errors and Logging
Testing
Key Takeaways
2. Analyzing Code Performance
Methods to Improve Performance
Timing Your Code
Profiling Your Code
cProfile
line_profiler
Memory Profiling with Memray
Time Complexity
How to Estimate Time Complexity
Big O Notation
Key Takeaways
3. Using Data Structures Effectively
Native Python Data Structures
Lists
Tuples
Dictionaries
Sets
NumPy Arrays
NumPy Array Functionality
NumPy Array Performance Considerations
Array Operations Using Dask
Arrays in Machine Learning
pandas DataFrames
DataFrame Functionality
DataFrame Performance Considerations
Key Takeaways
4. Object-Oriented Programming and Functional Programming
Object-Oriented Programming
Classes, Methods, and Attributes
Defining Your Own Classes
OOP Principles
Functional Programming
Lambda Functions and map()
Applying Functions to DataFrames
Which Paradigm Should I Use?
Key Takeaways
5. Errors, Logging, and Debugging
Errors in Python
Reading Python Error Messages
Handling Errors
Raising Errors
Logging
What to Log
Logging Configuration
How to Log
Debugging
Strategies for Debugging
Tools for Debugging
Key Takeaways
6. Code Formatting, Linting, and Type Checking
Code Formatting and Style Guides
PEP8
Import Formatting
Automatic Code Formatting with Black
Linting
Linting Tools
Linting in Your IDE
Type Checking
Type Annotations
Type Checking with mypy
Key Takeaways
7. Testing Your Code
Why You Should Write Tests
When to Test
How to Write and Run Tests
A Basic Test
Testing Unexpected Inputs
Running Automated Tests with Pytest
Types of Tests
Unit Tests
Integration Tests
Data Validation
Data Validation Examples
Using Pandera for Data Validation
Data Validation with Pydantic
Testing for Machine Learning
Testing Model Training
Testing Model Inference
Key Takeaways
8. Design and Refactoring
Project Design and Structure
Project Design Considerations
An Example Machine Learning Project
Code Design
Modular Code
A Code Design Framework
Interfaces and Contracts
Coupling
From Notebooks to Scalable Scripts
Why Use Scripts Instead of Notebooks?
Creating Scripts from Notebooks
Refactoring
Strategies for Refactoring
An Example Refactoring Workflow
Key Takeaways
9. Documentation
Documentation Within the Codebase
Names
Comments
Docstrings
Readmes, Tutorials, and Other Longer Documents
Documentation in Jupyter Notebooks
Documenting Machine Learning Experiments
Key Takeaways
10. Sharing Your Code: Version Control, Dependencies, and Packaging
Version Control Using Git
How Does Git Work?
Tracking Changes and Committing
Remote and Local
Branches and Pull Requests
Dependencies and Virtual Environments
Virtual Environments
Managing Dependencies with pip
Managing Dependencies with Poetry
Python Packaging
Packaging Basics
pyproject.toml
Building and Uploading Packages
Key Takeaways
11. APIs
Calling an API
HTTP Methods and Status Codes
Getting Data from the SDG API
Creating Your Own API Using FastAPI
Setting Up the API
Adding Functionality to Your API
Making Requests to Your API
Key Takeaways
12. Automation and Deployment
Deploying Code
Automation Examples
Pre-Commit Hooks
GitHub Actions
Cloud Deployments
Containers and Docker
Building a Docker Container
Deploying an API on Google Cloud
Deploying an API on Other Cloud Providers
Key Takeaways
13. Security
What Is Security?
Security Risks
Credentials, Physical Security, and Social Engineering
Third-Party Packages
The Python Pickle Module
Version Control Risks
API Security Risks
Security Practices
Security Reviews and Policies
Secure Coding Tools
Simple Code Scanning
Security for Machine Learning
Attacks on ML Systems
Security Practices for ML Systems
Key Takeaways
14. Working in Software
Development Principles and Practices
The Software Development Lifecycle
Waterfall Software Development
Agile Software Development
Agile Data Science
Roles in the Software Industry
Software Engineer
QA or Test Engineer
Data Engineer
Data Analyst
Product Manager
UX Researcher
Designer
Community
Open Source
Speaking at Events
The Python Community
Key Takeaways
15. Next Steps
The Future of Code
Your Future in Code
Thank You
Index
About the Author