Databricks Lakehouse Platform Cookbook: 100+ recipes for building a scalable and secure Databricks Lakehouse 9789355519566

Analyze, Architect, and Innovate with Databricks Lakehouse KEY FEATURES ● Create a Lakehouse using Databricks, including

139 26 52MB

English Pages 581 Year 2023

Table of contents :
Cover
Title Page
Copyright Page
Dedication Page
Foreword
About the Author
About the Reviewers
Acknowledgement
Preface
Table of Contents
1. Introduction to Databricks Lakehouse
Introduction
Structure
Objectives
Background
Brief history of Big Data, Spark, and Databricks
Databricks community edition
Recipe 1: Signing up for the Databricks community edition
Recipe 2: Creating a notebook in the Databricks Community edition
Recipe 3: Changing a notebook’s default language
Recipe 4: Create a table from CSV using SQL
Recipe 5: Query a table using SQL
Recipe 6: Examine a table’s structure
Recipe 7: Use infer schema on CSV in SQL
Recipe 8: Compute mean in group by in SQL
Recipe 9: Importing a notebook
Recipe 10: Exporting a notebook in Databricks Community Edition
Data Lakehouse value proposition
Lakehouse architecture
Separation of computing and storage
Data lake
Delta Lake
Computational engine
Design considerations
Extraction and storage by system
Zones and their definitions
Source
Bronze
Silver
Gold
Lakehouse compared to other data technologies
Extract load transform and extract transform load
Compared to traditional data lake approaches
Differences from Lambda architecture
Conclusion
Points to remember
2. Setting-up a Databricks Workspace
Introduction
Structure
Objectives
Core Databricks concepts
Databricks service tiers
Brief introduction of Databricks features
Machine Learning
Notebook access control
Databricks SQL and endpoints
Internet protocol addresses access control
Databricks pricing model
Pick your cloud
AWS
Azure
Google Cloud Platform
Deployment details
Public availability
Network size
Network peering
Initial configuration
Access control
Cluster types
All purpose
Job clusters
Cluster creation details
Single or multiple nodes
Access mode
Choosing performance level
Conclusion
3. Connecting to Storage
Introduction
Structure
Objectives
Databricks file system
Using mount points
Recipe 11: Using the DBFS file browser
Recipe 12: Using Databricks’ web terminal
Recipe 13: Using Databricks Utilities’ file system methods
The importance of DBFS
Lakehouse design
Source to Silver
Including raw
The document allowed operations crossing layers
Source to raw
Source to bronze
Raw to bronze
Bronze to silver
Silver to silver
Silver to gold and gold to gold
Recipes 14: Using the Lakehouse layer presentation
Azure
ADLS Gen2
Credential passthrough
Recipe 15: Creating a storage account for ADLS Gen2
Recipe 16: Creating a container and setting ACLs
Recipe 17: Using Passthrough authentication
Key vault and secret scope
Recipe 18: Link a key vault to a secret scope
Recipe 19: Displaying a redacted value
Blob storage
Recipe 20: Account keys
Recipe 21: Service principle
Recipe 22: Shared access signatures
Conclusion
4. Creating Delta Tables
Introduction
Structure
Objectives
Delta Lake
Managed and unmanaged tables
Deciding table type
Schema and database
Creating managed Delta tables
Ways to create
Recipe 23: Upload data using Databricks workspace
SQL
Recipe 24: Reading the SQL language reference
Recipe 25: Creating a table with SQL
Recipe 26: Creating a table with SQL using AS
Spark API
Recipe 27: Creating a table using Spark API and random data
Recipe 28: Examining table history
Managed tables details
Recipe 29: Managed Delta table details
Recipe 30: Using Data Explorer to see table details
Creating unmanaged tables
Recipe 31: Using Databricks CLI to create a secret scope
Recipe 32: Accessing S3 from Databricks on AWS
Recipe 33: Creating an external Delta table in SQL on AWS
Recipe 34: Creating an external table in PySpark on AWS
Recipe 35: Creating an external Delta table in SQL on Azure
Recipe 36: Creating an external table with Python on Azure
Recipe 37: Accessing GCP buckets from Databricks
Recipe 38: Creating an external Delta table in SQL on GCP
Recipe 39: Creating an external Delta table in Python on GCP
Conclusion
5. Data Profiling and Modeling in the Lakehouse
Introduction
Structure
Objectives
Data profiling
Recipe 40: Using Azure Data Factory to ingest raw
Recipe 41: Reorganize files
Recipe 42: Creating tables from a directory programmatically
Recipe 43: Data profiling using Databricks native functionality
Recipe 44: Listing row counts for all
Recipe 45: Using DBUtils summarize
Recipe 46: Using a DataFrames describe and summary methods
Recipe 47: Descriptive data analysis with Pandas profiling
Data modeling
Common modeling approaches
Entity-relationship data modelling
Star schema
Snowflake schema
Standardized data models
Retrieval optimized models
Design approach
Conclusion
6. Extracting from Source and Loading to Bronze
Introduction
Structure
Objectives
To raw or not to raw
Using change data feed
Overview of change data feed
Recipe 48: Creating a table with change data feed on
Recipe 49: Using Python to enable CDF
Recipe 50: Ensure CDF is enabled for all tables
Loading files using self-managed watermarks
Incremental ingestion example
Recipes 51: Using incremental load of files
Recipes 52: Convert Event Hub data to JSON
Recipes 53: Full load of files
Loading files using Auto Loader
Auto Loader overview
Recipe 54: Incremental ingestion of files Avro using Auto Loader in Python
Recipe 55: Incremental ingestion of CSV files using Auto Loader in Python
Loading files using Delta Live Tables
Delta Live Tables overview
Recipe 56: Using the DLT SQL API to ingest JSON
Recipe 57: Incremental ingestion using DLT using Python API
Recipe 58: Full ingestion using DLT using SQL API
Recipe 59: Full ingestion using DLT using Python API
Loading streaming data
Recipe 60: Parameterizing pipelines
Recipe 61: Stream processing with DLT Python API
Recipe 62: Using Spark structured streaming
Conclusion
7. Transforming to Create Silver
Introduction
Structure
Objectives
Bronze to silver
Incremental refinement
Recipe 63: Incremental refinement using Delta Live Tables
Recipe 64: Incremental refinement using PySpark
Full refinement
Recipe 65: Full update refinement using Delta Live Tables
Recipe 66: Full refinement using PySpark
Data quality rules
Recipe 67: Using expectations in DLT with SQL
Recipe 68: Using expectations in DLT with PySpark
Silver to silver
Reshaping projection
Recipe 69: Projection reshaping using Python
Recipe 70: Projection reshaping using Delta Live Tables
Splitting tables
Recipe 71: Splitting table into multiple in PySpark
Recipe 72: Splitting table into multiple in Delta Live Tables
Enrichment
Recipe 73: Creating lookup data from telemetry
Recipe 74: Combining tables using DLT
Conclusion
8. Transforming to Create Gold for Business Purposes
Introduction
Structure
Objectives
Silver to gold
Aggregation
Recipe 75: Aggregation in Delta Live Tables
Dimensional tables using PySpark
Recipe 76: Creating a time dimension
Recipe 77: Creating a dimension from telemetry
Recipe 78: Creating a fact table from telemetry
Dimensional tables in Delta Live Tables
Recipe 79: Dimensional models with Delta Live Table
Using Common Data Models with Delta Live Tables
Microsoft Common Data Model
Gold to gold
Table optimization for consumption
Optimize
Recipe 80: Manually optimize a table
Vacuum
Recipe 81: Vacuum a Delta table
Conclusion
9. Machine Learning and Data Science
Introduction
Structure
Objectives
Machine Learning in Databricks
Using AutoML
Recipe 82: Creating an ML cluster
Recipe 83: Importing data with the Databricks web page
Recipe 84: Creating and running an AutoML experiment
Setting up and using MLflow
Recipe 85: Setting up an MLflow experiment
Recipe 86: Using MLflow for non-ML workflows
Deploying models to production
Recipe 87: Registering a model
Recipe 88: Using a model for inference
Using Databricks feature store
Recipe 89: Importing an HTML notebook
Recipe 90: Basic interaction with Databricks Feature Store
Conclusion
10. SQL Analysis
Introduction
Structure
Objectives
Databricks SQL
Creating and managing a SQL Warehouse
Recipe 91: Creating a SQL Warehouse
Recipe 92: Connect to a SQL Warehouse from a Python Jupyter Notebook
Using the SQL Editor
Writing queries
Common interview queries
Recipe 93: Show the contents of a table
Recipe 94: Select with filtered ordered limited result
Recipe 95: Aggregation of records
Recipe 96: Using grouping to find duplicate records
Recipe 97: Generating synthetic data
Recipe 98: Calculate rollups
Recipe 99: Types of joins
Inner
Left and right outer joins
Full outer join
Cross join
Creating dashboards
Recipe 100: Creating a quick dashboard
Recipe 101: Schedule dashboard refresh
Setting alerts
Recipe 102: Create a query for an alert
Recipe 103: Create an alert
Cost and performance considerations
Conclusion
11. Graph Analysis
Introduction
Structure
Objectives
What is a graph
When to use graph operations
GraphX
GraphFrames
Recipe 104: Creating a GraphFrame
Recipe 105: Using example graphs
Graph operations and algorithms
Recipe 106: Breadth-first search
Recipe 107: PageRank
Recipe 108: Shortest path
Recipe 109: Connected components
Recipe 110: Strongly connected components
Recipe 111: Label Propagation Algorithm
Recipe 112: Motif finding
Neo4J and Databricks
Recipe 113: Using AuraDB
Recipe 114: Reading Neo4J’s AuraDB from Databricks
Conclusion
12. Visualizations
Introduction
Structure
Objectives
Visualization best practices
Visually appealing
Keep it simple
Explain unfamiliar graph types
Follow conventions
Tell a story
Databricks dashboards
Recipe 115: Importing sample dashboards
Recipe 116: Data preparation for a new dashboard
Recipe 117: Creating a dashboard
Visualizations in Databricks notebooks
Recipe 118: Using visualizations in notebooks
Power BI
Recipe 119: Connecting Power BI to Databricks
Conclusion
13. Governance
Introduction
Structure
Objectives
Role of data governance
Using Unity Catalog
Recipe 120: Configuring Unity Catalog in Azure
Creating storage
Create a managed identity
Create Access Connector for Azure Databricks
Grant managed identity access
Creating a metastore
Unity Catalog object model
Recipe 121: Creating a new catalog
Recipe 122: Uploading data
Recipe 123: Creating a table
Installing and using Purview
Recipe 124: Installing Purview
Recipe 125: Connecting Purview to Databricks
Recipe 126: Scanning a Databricks workspace
Recipe 127: Browsing the Data Catalog
Conclusion
14. Operations
Introduction
Structure
Objectives
Source code management and orchestration
Recipe 128: Use GitHub with Databricks
Recipe 129: Create workflows to orchestrate processing
Recipe 130: Saving a Job JSON
Recipe 131: Use Airflow to coordinate processing
Scheduled and ongoing maintenance
Recipe 132: Repairing damaged tables
Recipe 133: Vacuum unneeded data
Recipe 134: Optimize Delta tables
Cost management
Recipe 135: Use cluster policies
Recipe 136: Using tags to monitor costs
Conclusion
15. Tips, Tricks, Troubleshooting, and Best Practices
Introduction
Structure
Objectives
Ingesting relational data with Databricks
Recipe 137: Loading data from MySQL
Recipe 138: Extending a Python class and reading using Databricks runtime format
Recipe 139: Caching DataFrames
Recipe 140: Loading data from MySQL using workers
Performance optimization
Using Databricks even log
Exploring the Spark UI jobs tab
Using the Spark UI SQL/DataFrame tab
Recipe 141: Using pools to improve performance
Programmatic deployment and interaction
Recipe 142: Creating a workspace with ARM Template
Recipe 143: Using the Databricks API
Reading a Kafka stream
Recipe 144: Creating a Kafka cluster
Recipe 145: Using confluent cloud
Notebook orchestration
Recipe 146: Running a notebook with parameters
Recipe 147: Conditional execution of notebooks
Best practices
Organize data assets by source until silver
Use automation as much as possibly
Use version control
Keep each step of the process simple
Do not be afraid to change
Conclusion
Index

Databricks Lakehouse Platform Cookbook: 100+ recipes for building a scalable and secure Databricks Lakehouse
9789355519566

Author / Uploaded
Dr. Alan L. Dennis

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

Recommend Papers

The Azure Data Lakehouse Toolkit: Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake 1484282329, 9781484282328

Design and implement a modern data lakehouse on the Azure Data Platform using Delta Lake, Apache Spark, Azure Databricks

109 92 26MB Read more

Databricks Spark 知识库

358 10 601KB Read more

Databricks Spark Reference Applications

345 11 630KB Read more

Querying Databricks with Spark SQL: Leverage SQL to query and analyze Big Data for insights 9789355518019

A practical guide to using Spark SQL to perform complex queries on your Databricks data Description Databricks stands o

106 11 37MB Read more

Building Cross-Platform Apps with Flutter and Dart: Build scalable apps for Android, iOS, and web from a single codebase

Learn how to create powerful apps for multiple platforms with Flutter and Dart Key Features ● Design visually striking

338 73 152MB Read more

Marijuana Cookbook - Top 100 Recipes

426 7 6MB Read more

Flutter Cookbook: 100+ step-by-step recipes for building cross-platform, professional-grade apps with Flutter 3.10.x and Dart 3.x, [2 ed.] 1803245433, 9781803245430

Write, test, and publish your web, desktop, and embedded apps with this most up-to-date book on Flutter using the Dart p

335 120 14MB Read more

Azure Databricks Cookbook: Accelerate and scale real-time analytics solutions using the Apache Spark-based analytics service 1789809711, 9781789809718

Get to grips with building and productionizing end-to-end big data solutions in Azure and learn best practices for worki

509 113 27MB Read more

Azure Databricks Cookbook: Accelerate and scale real-time analytics solutions using the Apache Spark-based analytics service 9781789809718, 1789809711

Get to grips with building and productionizing end-to-end big data solutions in Azure and learn best practices for worki

262 73 34MB Read more

Flutter Cookbook: 100+ step-by-step recipes for building cross-platform, professional-grade apps with Flutter 3.10.x and Dart 3.x, 2nd Edition 9781803245430, 1803245433

Write, test, and publish your web, desktop, and embedded apps with this most up-to-date book on Flutter using the Dart p

102 19 27MB Read more