Data Engineering with dbt: A practical guide to building a cloud-based pragmatic and dependable data platform with SQL
9781803246284
Use easy-to-apply patterns in SQL and Python to adopt modern analytics engineering to build agile platforms with dbt tha
363
125
13MB
English
Pages 578
Year 2023
Report DMCA / Copyright
DOWNLOAD EPUB FILE
Table of contents :
Data Engineering with dbt
Contributors
About the author
About the reviewers
Preface
Who this book is for
What this book covers
To get the most out of this book
Download the example code files
Conventions used
Get in touch
Share Your Thoughts
Download a free PDF copy of this book
Part 1: The Foundations of Data Engineering
1
The Basics of SQL to Transform Data
Technical requirements
Introducing SQL
SQL basics – core concepts and commands
SQL core concepts
Understanding the categories of SQL commands
Setting up a Snowflake database with users and roles
Creating your Snowflake account
Setting up initial users, roles, and a database in Snowflake
Creating and granting your first role
Querying data in SQL – syntax and operators
Snowflake query syntax
SQL operators
Combining data in SQL – the JOIN clause
Combining orders and customers
JOIN types
Visual representation of join types
Advanced – introducing window functions
Window definition
Window frame definition
Summary
Further reading
2
Setting Up Your dbt Cloud Development Environment
Technical requirements
Setting up your GitHub account
Introducing Version Control
Creating your GitHub account
Setting up your first repository for dbt
Setting up your dbt Cloud account
Signing up for a dbt Cloud account
Setting up your first dbt Cloud project
Adding the default project to an empty repository
Comparing dbt Core and dbt Cloud workflows
dbt Core workflows
dbt Cloud workflows
Experimenting with SQL in dbt Cloud
Exploring the dbt Cloud IDE
Executing SQL from the dbt IDE
Introducing the source and ref dbt functions
Exploring the dbt default model
Using ref and source to connect models
Running your first models
Testing your first models
Editing your first model
Summary
Further reading
3
Data Modeling for Data Engineering
Technical requirements
What is and why do we need data modeling?
Understanding data
What is data modeling?
Why we need data modeling
Complementing a visual data model
Conceptual, logical, and physical data models
Conceptual data model
Logical data model
Physical data model
Tools to draw data models
Entity-Relationship modeling
Main notation
Cardinality
Time perspective
An example of an E-R model at different levels of detail
Generalization and specialization
Modeling use cases and patterns
Header-detail use case
Hierarchical relationships
Forecasts and actuals
Libraries of standard data models
Common problems in data models
Fan trap
Chasm trap
Modeling styles and architectures
Kimball method or dimensional modeling or star schema
Unified Star Schema
Inmon design style
Data Vault
Data mesh
Our approach, the Pragmatic Data Platform - PDP
Summary
Further reading
4
Analytics Engineering as the New Core of Data Engineering
Technical requirements
The data life cycle and its evolution
Understanding the data flow
Data creation
Data movement and storage
Data transformation
Business reporting
Feeding back to the source systems
Understanding the modern data stack
The traditional data stack
The modern data stack
Defining analytics engineering
The roles in the modern data stack
The analytics engineer
DataOps – software engineering best practices for data
Version control
Quality assurance
The modularity of the code base
Development environments
Designing for maintainability
Summary
Further reading
5
Transforming Data with dbt
Technical requirements
The dbt Core workflow for ingesting and transforming data
Introducing our stock tracking project
The initial data model and glossary
Setting up the project in dbt, Snowflake, and GitHub
Defining data sources and providing reference data
Defining data sources in dbt
Loading the first data for the portfolio project
How to write and test transformations
Writing the first dbt model
Real-time lineage and project navigation
Deploying the first dbt model
Committing the first dbt model
Configuring our project and where we store data
Re-deploying our environment to the desired schema
Configuring the layers for our architecture
Ensuring data quality with tests
Generating the documentation
Summary
Part 2: Agile Data Engineering with dbt
6
Writing Maintainable Code
Technical requirements
Writing code for humans
Refactoring our initial model to be human-readable
Creating the architectural layers
Creating the Staging layer
Goals and contents of the staging models
Connecting the REF model to the STG
Goals and contents of the refined layer
Creating the first data mart
Saving history is crucial
Saving history with dbt
Saving history using snapshots
Connecting the REF layer with the snapshot
Summary
7
Working with Dimensional Data
Adding dimensional data
Creating clear data models for the refined and data mart layers
Loading the data of the first dimension
Creating and loading a CSV as a seed
Configuring the seeds and loading them
Adding data types and a load timestamp to your seed
Building the STG model for the first dimension
Defining the external data source for seeds
Creating an STG model for the security dimension
Adding the default record to the STG
Saving history for the dimensional data
Saving the history with a snapshot
Building the REF layer with the dimensional data
Adding the dimensional data to the data mart
Exercise – adding a few more hand-maintained dimensions
Summary
8
Delivering Consistency in Your Data
Technical requirements
Keeping consistency by reusing code – macros
Repetition is inherent in data projects
Why copy and paste kills your future self
How to write a macro
Refactoring the “current” CTE into a macro
Fixing data loaded from our CSV file
The basics of macro writing
Building on the shoulders of giants – dbt packages
Creating dbt packages
How to import a package in dbt
Browsing through noteworthy packages for dbt
Adding the dbt-utils package to our project
Summary
9
Delivering Reliability in Your Data
Testing to provide reliability
Types of tests
Singular tests
Generic tests
Defining a generic test
Testing the right things in the right places
What do we test?
Where to test what?
Testing our models to ensure good quality
Summary
10
Agile Development
Technical requirements
Agile development and collaboration
Defining agile development
Applying agile to data engineering
Starting a project in an agile way
Organizing work the agile way
Managing the backlog in an agile way
Building reports in an agile way
S1 – designing a light data model for the data mart
S2 – designing a light data model for the REF layer
S3.x – developing with dbt models the pipeline for the XYZ table
S4 – an acceptance test of the data produced in the data mart
S5 – development and verification of the report in the BI application
Summary
11
Team Collaboration
Enabling collaboration
Core collaboration practices
Collaboration with dbt Cloud
Working with branches and PRs
Working with Git in dbt Cloud
The dbt Cloud Git process
Keeping your development environment healthy
Suggested Git branch naming
Adopting frequent releases
Making your first PR
Summary
Further reading
Part 3: Hands-On Best Practices for Simple, Future-Proof Data Platforms
12
Deployment, Execution, and Documentation Automation
Technical requirements
Designing your deployment automation
Working with dbt environments
Creating our QA and PROD environments
Deciding where to deploy
Creating jobs
Designing the architecture of your data platform
Notifications
Advanced automation – hooks and run-operations
Hooks
Run-operations
Table migrations
Documentation
Lineage graph
dbt-generated documentation
Source freshness report
Exposures
Markdown documentation
Summary
13
Moving Beyond the Basics
Technical requirements
Building for modularity
Modularity in the storage layer
Modularity in the refined layer
Modularity in the delivery layer
Managing identity
Identity and semantics – defining your concepts
Different types of keys
Main uses of keys
Master Data management
Data for Master Data management
A light MDM approach with DBT
Saving history at scale
Understanding the save_history macro
Understanding the current_from_history macro
Summary
14
Enhancing Software Quality
Technical requirements
Refactoring and evolving models
Dealing with technical debt
Implementing real-world code and business rules
Replacing snapshots with HIST tables
Renaming the REF_ABC_BANK_SECURITY_INFO model
Handling orphans in facts
Calculating closed positions
Calculating transactions
Publishing dependable datasets
Managing data marts like APIs
What shape should you use for your data mart?
Self-completing dimensions
History in reports – that is, slowly changing dimensions type two
Summary
Further reading
15
Patterns for Frequent Use Cases
Technical requirements
Ingestion patterns
Basic setup for ingestion
Loading data from files
External tables
Landing tables
History patterns
Storing history with deletions – full load
Storing history with deletion – deletion list
Storing history with multiple versions in the input
Storing history with PII and GDPR compliance
History and schema evolution
Summary
Further reading
Index
Why subscribe?
Other Books You May Enjoy
Packt is searching for authors like you
Share Your Thoughts
Download a free PDF copy of this book