Table of contents : Preface How to Contact Us Conventions Used in This Book Using Code Examples O’Reilly Online Learning Acknowledgment 1. The Evolution of Data Architectures A Brief History of Relational Databases Data Warehouses Data Warehouse Architecture Dimensional Modeling Data Warehouse Benefits and Challenges Introducing Data Lakes Data Lakehouse Data Lakehouse Benefits Implementing a Lakehouse Delta Lake The Medallion Architecture The Delta Ecosystem Delta Lake Storage Delta Sharing Delta Connectors Conclusion 2. Getting Started with Delta Lake Getting a Standard Spark Image Using Delta Lake with PySpark Running Delta Lake in the Spark Scala Shell Running Delta Lake on Databricks Creating and Running a Spark Program: helloDeltaLake The Delta Lake Format Parquet Files Advantages of Parquet files Writing a Parquet file Writing a Delta Table The Delta Lake Transaction Log How the Transaction Log Implements Atomicity Breaking Down Transactions into Atomic Commits The Transaction Log at the File Level Write multiple writes to the same file Reading the latest version of a Delta table Failure scenario with a write operation Update scenario Scaling Massive Metadata Checkpoint file example Displaying the checkpoint file Conclusion 3. Basic Operations on Delta Tables Creating a Delta Table Creating a Delta Table with SQL DDL The DESCRIBE Statement Creating Delta Tables with the DataFrameWriter API Creating a managed table Creating an unmanaged table Creating a Delta Table with the DeltaTableBuilder API Generated Columns Reading a Delta Table Reading a Delta Table with SQL Reading a Table with PySpark Writing to a Delta Table Cleaning Out the YellowTaxis Table Inserting Data with SQL INSERT Appending a DataFrame to a Table Using the OverWrite Mode When Writing to a Delta Table Inserting Data with the SQL COPY INTO Command Partitions Partitioning by a single column Partitioning by multiple columns Checking if a partition exists Selectively updating Delta partitions with replaceWhere User-Defined Metadata Using SparkSession to Set Custom Metadata Using the DataFrameWriter to Set Custom Metadata Conclusion 4. Table Deletes, Updates, and Merges Deleting Data from a Delta Table Table Creation and DESCRIBE HISTORY Performing the DELETE Operation DELETE Performance Tuning Tips Updating Data in a Table Use Case Description Updating Data in a Table UPDATE Performance Tuning Tips Upsert Data Using the MERGE Operation Use Case Description The MERGE Dataset The MERGE Statement Modifying unmatched rows using MERGE Analyzing the MERGE operation with DESCRIBE HISTORY Inner Workings of the MERGE Operation Conclusion 5. Performance Tuning Data Skipping Partitioning Partitioning Warnings and Considerations Compact Files Compaction OPTIMIZE OPTIMIZE considerations ZORDER BY ZORDER BY Considerations Liquid Clustering Enabling Liquid Clustering Operations on Clustered Columns Changing clustered columns Viewing clustered columns Removing clustered columns Liquid Clustering Warnings and Considerations Conclusion 6. Using Time Travel Delta Lake Time Travel Restoring a Table Restoring via Timestamp Time Travel Under the Hood RESTORE Considerations and Warnings Querying an Older Version of a Table Data Retention Data File Retention Log File Retention Setting File Retention Duration Example Data Archiving VACUUM VACUUM Syntax and Examples How Often Should You Run VACUUM and Other Maintenance Tasks? VACUUM Warnings and Considerations Changing Data Feed Enabling the CDF Viewing the CDF CDF Warnings and Considerations Conclusion 7. Schema Handling Schema Validation Viewing the Schema in the Transaction Log Entries Schema on Write Schema Enforcement Example Matching schema Schema with an additional column Schema Evolution Adding a Column Missing Data Column in Source DataFrame Changing a Column Data Type Adding a NullType Column Explicit Schema Updates Adding a Column to a Table Adding Comments to a Column Changing Column Ordering Delta Lake Column Mapping Renaming a Column Replacing the Table Columns Dropping a Column The REORG TABLE Command Changing Column Data Type or Name Conclusion 8. Operations on Streaming Data Streaming Overview Spark Structured Streaming Delta Lake and Structured Streaming Streaming Examples Hello Streaming World Creating the streaming query The query process log The checkpoint file AvailableNow Streaming Updating the Source Records The StreamingQuery class Reprocessing all or part of the source records Reading a Stream from the Change Data Feed Conclusion 9. Delta Sharing Conventional Methods of Data Sharing Legacy and Homegrown Solutions Proprietary Vendor Solutions Cloud Object Storage Open Source Delta Sharing Delta Sharing Goals Delta Sharing Under the Hood Data Providers and Recipients Benefits of the Design The delta-sharing Repository Step 1: Installing the Python Connector Step 2: Installing the Profile File Step 3: Reading a Shared Table Conclusion 10. Building a Lakehouse on Delta Lake Storage Layer What Is a Data Lake? Types of Data Key Benefits of a Cloud Data Lake Data Management SQL Analytics SQL Analytics via Spark SQL SQL Analytics via Other Delta Lake Integrations Data for Data Science and Machine Learning Challenges with Traditional Machine Learning Delta Lake Features That Support Machine Learning Putting It All Together Medallion Architecture The Bronze Layer (Raw Data) The Silver Layer The Gold Layer The Complete Lakehouse Conclusion Index