Delta Lake: Up and Running: Modern Data Lakehouse Architectures with Delta Lake 9781098139728

With the surge in big data and AI, organizations can rapidly create data products. However, the effectiveness of their a

124 27 6MB

English Pages 264 Year 2023

Table of contents :
Preface
How to Contact Us
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
Acknowledgment
1. The Evolution of Data Architectures
A Brief History of Relational Databases
Data Warehouses
Data Warehouse Architecture
Dimensional Modeling
Data Warehouse Benefits and Challenges
Introducing Data Lakes
Data Lakehouse
Data Lakehouse Benefits
Implementing a Lakehouse
Delta Lake
The Medallion Architecture
The Delta Ecosystem
Delta Lake Storage
Delta Sharing
Delta Connectors
Conclusion
2. Getting Started with Delta Lake
Getting a Standard Spark Image
Using Delta Lake with PySpark
Running Delta Lake in the Spark Scala Shell
Running Delta Lake on Databricks
Creating and Running a Spark Program: helloDeltaLake
The Delta Lake Format
Parquet Files
Advantages of Parquet files
Writing a Parquet file
Writing a Delta Table
The Delta Lake Transaction Log
How the Transaction Log Implements Atomicity
Breaking Down Transactions into Atomic Commits
The Transaction Log at the File Level
Write multiple writes to the same file
Reading the latest version of a Delta table
Failure scenario with a write operation
Update scenario
Scaling Massive Metadata
Checkpoint file example
Displaying the checkpoint file
Conclusion
3. Basic Operations on Delta Tables
Creating a Delta Table
Creating a Delta Table with SQL DDL
The DESCRIBE Statement
Creating Delta Tables with the DataFrameWriter API
Creating a managed table
Creating an unmanaged table
Creating a Delta Table with the DeltaTableBuilder API
Generated Columns
Reading a Delta Table
Reading a Delta Table with SQL
Reading a Table with PySpark
Writing to a Delta Table
Cleaning Out the YellowTaxis Table
Inserting Data with SQL INSERT
Appending a DataFrame to a Table
Using the OverWrite Mode When Writing to a Delta Table
Inserting Data with the SQL COPY INTO Command
Partitions
Partitioning by a single column
Partitioning by multiple columns
Checking if a partition exists
Selectively updating Delta partitions with replaceWhere
User-Defined Metadata
Using SparkSession to Set Custom Metadata
Using the DataFrameWriter to Set Custom Metadata
Conclusion
4. Table Deletes, Updates, and Merges
Deleting Data from a Delta Table
Table Creation and DESCRIBE HISTORY
Performing the DELETE Operation
DELETE Performance Tuning Tips
Updating Data in a Table
Use Case Description
Updating Data in a Table
UPDATE Performance Tuning Tips
Upsert Data Using the MERGE Operation
Use Case Description
The MERGE Dataset
The MERGE Statement
Modifying unmatched rows using MERGE
Analyzing the MERGE operation with DESCRIBE HISTORY
Inner Workings of the MERGE Operation
Conclusion
5. Performance Tuning
Data Skipping
Partitioning
Partitioning Warnings and Considerations
Compact Files
Compaction
OPTIMIZE
OPTIMIZE considerations
ZORDER BY
ZORDER BY Considerations
Liquid Clustering
Enabling Liquid Clustering
Operations on Clustered Columns
Changing clustered columns
Viewing clustered columns
Removing clustered columns
Liquid Clustering Warnings and Considerations
Conclusion
6. Using Time Travel
Delta Lake Time Travel
Restoring a Table
Restoring via Timestamp
Time Travel Under the Hood
RESTORE Considerations and Warnings
Querying an Older Version of a Table
Data Retention
Data File Retention
Log File Retention
Setting File Retention Duration Example
Data Archiving
VACUUM
VACUUM Syntax and Examples
How Often Should You Run VACUUM and Other Maintenance Tasks?
VACUUM Warnings and Considerations
Changing Data Feed
Enabling the CDF
Viewing the CDF
CDF Warnings and Considerations
Conclusion
7. Schema Handling
Schema Validation
Viewing the Schema in the Transaction Log Entries
Schema on Write
Schema Enforcement Example
Matching schema
Schema with an additional column
Schema Evolution
Adding a Column
Missing Data Column in Source DataFrame
Changing a Column Data Type
Adding a NullType Column
Explicit Schema Updates
Adding a Column to a Table
Adding Comments to a Column
Changing Column Ordering
Delta Lake Column Mapping
Renaming a Column
Replacing the Table Columns
Dropping a Column
The REORG TABLE Command
Changing Column Data Type or Name
Conclusion
8. Operations on Streaming Data
Streaming Overview
Spark Structured Streaming
Delta Lake and Structured Streaming
Streaming Examples
Hello Streaming World
Creating the streaming query
The query process log
The checkpoint file
AvailableNow Streaming
Updating the Source Records
The StreamingQuery class
Reprocessing all or part of the source records
Reading a Stream from the Change Data Feed
Conclusion
9. Delta Sharing
Conventional Methods of Data Sharing
Legacy and Homegrown Solutions
Proprietary Vendor Solutions
Cloud Object Storage
Open Source Delta Sharing
Delta Sharing Goals
Delta Sharing Under the Hood
Data Providers and Recipients
Benefits of the Design
The delta-sharing Repository
Step 1: Installing the Python Connector
Step 2: Installing the Profile File
Step 3: Reading a Shared Table
Conclusion
10. Building a Lakehouse on Delta Lake
Storage Layer
What Is a Data Lake?
Types of Data
Key Benefits of a Cloud Data Lake
Data Management
SQL Analytics
SQL Analytics via Spark SQL
SQL Analytics via Other Delta Lake Integrations
Data for Data Science and Machine Learning
Challenges with Traditional Machine Learning
Delta Lake Features That Support Machine Learning
Putting It All Together
Medallion Architecture
The Bronze Layer (Raw Data)
The Silver Layer
The Gold Layer
The Complete Lakehouse
Conclusion
Index

Delta Lake: Up and Running: Modern Data Lakehouse Architectures with Delta Lake
9781098139728

Author / Uploaded
Bennie Haelen
Dan Davis

Similar Topics
Computers
Programming

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

Recommend Papers

Delta Lake: Up and Running: Modern Data Lakehouse Architectures with Delta Lake [1 ed.] 1098139720, 9781098139728

With the surge in big data and AI, organizations can rapidly create data products. However, the effectiveness of their a

103 25 6MB Read more

The Azure Data Lakehouse Toolkit: Building and Scaling Data Lakehouses on Azure with Delta Lake, Apache Spark, Databricks, Synapse Analytics, and Snowflake 1484282329, 9781484282328

Design and implement a modern data lakehouse on the Azure Data Platform using Delta Lake, Apache Spark, Azure Databricks

109 92 26MB Read more

Up the Lake

Living off-the-grid in coastal British Columbia, where mountains drop into the sea and people practice self-reliance and

98 73 Read more

Deciphering Data Architectures: Choosing Between a Modern Data Warehouse, Data Fabric, Data Lakehouse, and Data Mesh 9781098150761

Data fabric, data lakehouse, and data mesh have recently appeared as viable alternatives to the modern data warehouse. T

103 92 7MB Read more

The Enterprise Big Data Lake 9781491931554

203 86 12MB Read more

The Enterprise Big Data Lake 9781491931554

290 79 20MB Read more

The Enterprise Big Data Lake 9781491931554

195 57 9MB Read more

Tom Lake

121 77 3MB Read more

Lake Hydrology: An Introduction to Lake Mass Balance 9781421439945, 1421439948

119 37 40MB Read more

Lake Baikal: Siberia's Great Lake [1 ed.] 184162294X, 9781841622941

Keen geographers know that Baikal is the world's largest freshwater lake and that it's home to unique species

144 81 139MB Read more