Table of contents : Cover Title Page Copyright Page Dedication Page Foreword About the Author About the Reviewers Acknowledgement Preface Table of Contents 1. Introduction to Databricks Lakehouse Introduction Structure Objectives Background Brief history of Big Data, Spark, and Databricks Databricks community edition Recipe 1: Signing up for the Databricks community edition Recipe 2: Creating a notebook in the Databricks Community edition Recipe 3: Changing a notebook’s default language Recipe 4: Create a table from CSV using SQL Recipe 5: Query a table using SQL Recipe 6: Examine a table’s structure Recipe 7: Use infer schema on CSV in SQL Recipe 8: Compute mean in group by in SQL Recipe 9: Importing a notebook Recipe 10: Exporting a notebook in Databricks Community Edition Data Lakehouse value proposition Lakehouse architecture Separation of computing and storage Data lake Delta Lake Computational engine Design considerations Extraction and storage by system Zones and their definitions Source Bronze Silver Gold Lakehouse compared to other data technologies Extract load transform and extract transform load Compared to traditional data lake approaches Differences from Lambda architecture Conclusion Points to remember 2. Setting-up a Databricks Workspace Introduction Structure Objectives Core Databricks concepts Databricks service tiers Brief introduction of Databricks features Machine Learning Notebook access control Databricks SQL and endpoints Internet protocol addresses access control Databricks pricing model Pick your cloud AWS Azure Google Cloud Platform Deployment details Public availability Network size Network peering Initial configuration Access control Cluster types All purpose Job clusters Cluster creation details Single or multiple nodes Access mode Choosing performance level Conclusion 3. Connecting to Storage Introduction Structure Objectives Databricks file system Using mount points Recipe 11: Using the DBFS file browser Recipe 12: Using Databricks’ web terminal Recipe 13: Using Databricks Utilities’ file system methods The importance of DBFS Lakehouse design Source to Silver Including raw The document allowed operations crossing layers Source to raw Source to bronze Raw to bronze Bronze to silver Silver to silver Silver to gold and gold to gold Recipes 14: Using the Lakehouse layer presentation Azure ADLS Gen2 Credential passthrough Recipe 15: Creating a storage account for ADLS Gen2 Recipe 16: Creating a container and setting ACLs Recipe 17: Using Passthrough authentication Key vault and secret scope Recipe 18: Link a key vault to a secret scope Recipe 19: Displaying a redacted value Blob storage Recipe 20: Account keys Recipe 21: Service principle Recipe 22: Shared access signatures Conclusion 4. Creating Delta Tables Introduction Structure Objectives Delta Lake Managed and unmanaged tables Deciding table type Schema and database Creating managed Delta tables Ways to create Recipe 23: Upload data using Databricks workspace SQL Recipe 24: Reading the SQL language reference Recipe 25: Creating a table with SQL Recipe 26: Creating a table with SQL using AS Spark API Recipe 27: Creating a table using Spark API and random data Recipe 28: Examining table history Managed tables details Recipe 29: Managed Delta table details Recipe 30: Using Data Explorer to see table details Creating unmanaged tables Recipe 31: Using Databricks CLI to create a secret scope Recipe 32: Accessing S3 from Databricks on AWS Recipe 33: Creating an external Delta table in SQL on AWS Recipe 34: Creating an external table in PySpark on AWS Recipe 35: Creating an external Delta table in SQL on Azure Recipe 36: Creating an external table with Python on Azure Recipe 37: Accessing GCP buckets from Databricks Recipe 38: Creating an external Delta table in SQL on GCP Recipe 39: Creating an external Delta table in Python on GCP Conclusion 5. Data Profiling and Modeling in the Lakehouse Introduction Structure Objectives Data profiling Recipe 40: Using Azure Data Factory to ingest raw Recipe 41: Reorganize files Recipe 42: Creating tables from a directory programmatically Recipe 43: Data profiling using Databricks native functionality Recipe 44: Listing row counts for all Recipe 45: Using DBUtils summarize Recipe 46: Using a DataFrames describe and summary methods Recipe 47: Descriptive data analysis with Pandas profiling Data modeling Common modeling approaches Entity-relationship data modelling Star schema Snowflake schema Standardized data models Retrieval optimized models Design approach Conclusion 6. Extracting from Source and Loading to Bronze Introduction Structure Objectives To raw or not to raw Using change data feed Overview of change data feed Recipe 48: Creating a table with change data feed on Recipe 49: Using Python to enable CDF Recipe 50: Ensure CDF is enabled for all tables Loading files using self-managed watermarks Incremental ingestion example Recipes 51: Using incremental load of files Recipes 52: Convert Event Hub data to JSON Recipes 53: Full load of files Loading files using Auto Loader Auto Loader overview Recipe 54: Incremental ingestion of files Avro using Auto Loader in Python Recipe 55: Incremental ingestion of CSV files using Auto Loader in Python Loading files using Delta Live Tables Delta Live Tables overview Recipe 56: Using the DLT SQL API to ingest JSON Recipe 57: Incremental ingestion using DLT using Python API Recipe 58: Full ingestion using DLT using SQL API Recipe 59: Full ingestion using DLT using Python API Loading streaming data Recipe 60: Parameterizing pipelines Recipe 61: Stream processing with DLT Python API Recipe 62: Using Spark structured streaming Conclusion 7. Transforming to Create Silver Introduction Structure Objectives Bronze to silver Incremental refinement Recipe 63: Incremental refinement using Delta Live Tables Recipe 64: Incremental refinement using PySpark Full refinement Recipe 65: Full update refinement using Delta Live Tables Recipe 66: Full refinement using PySpark Data quality rules Recipe 67: Using expectations in DLT with SQL Recipe 68: Using expectations in DLT with PySpark Silver to silver Reshaping projection Recipe 69: Projection reshaping using Python Recipe 70: Projection reshaping using Delta Live Tables Splitting tables Recipe 71: Splitting table into multiple in PySpark Recipe 72: Splitting table into multiple in Delta Live Tables Enrichment Recipe 73: Creating lookup data from telemetry Recipe 74: Combining tables using DLT Conclusion 8. Transforming to Create Gold for Business Purposes Introduction Structure Objectives Silver to gold Aggregation Recipe 75: Aggregation in Delta Live Tables Dimensional tables using PySpark Recipe 76: Creating a time dimension Recipe 77: Creating a dimension from telemetry Recipe 78: Creating a fact table from telemetry Dimensional tables in Delta Live Tables Recipe 79: Dimensional models with Delta Live Table Using Common Data Models with Delta Live Tables Microsoft Common Data Model Gold to gold Table optimization for consumption Optimize Recipe 80: Manually optimize a table Vacuum Recipe 81: Vacuum a Delta table Conclusion 9. Machine Learning and Data Science Introduction Structure Objectives Machine Learning in Databricks Using AutoML Recipe 82: Creating an ML cluster Recipe 83: Importing data with the Databricks web page Recipe 84: Creating and running an AutoML experiment Setting up and using MLflow Recipe 85: Setting up an MLflow experiment Recipe 86: Using MLflow for non-ML workflows Deploying models to production Recipe 87: Registering a model Recipe 88: Using a model for inference Using Databricks feature store Recipe 89: Importing an HTML notebook Recipe 90: Basic interaction with Databricks Feature Store Conclusion 10. SQL Analysis Introduction Structure Objectives Databricks SQL Creating and managing a SQL Warehouse Recipe 91: Creating a SQL Warehouse Recipe 92: Connect to a SQL Warehouse from a Python Jupyter Notebook Using the SQL Editor Writing queries Common interview queries Recipe 93: Show the contents of a table Recipe 94: Select with filtered ordered limited result Recipe 95: Aggregation of records Recipe 96: Using grouping to find duplicate records Recipe 97: Generating synthetic data Recipe 98: Calculate rollups Recipe 99: Types of joins Inner Left and right outer joins Full outer join Cross join Creating dashboards Recipe 100: Creating a quick dashboard Recipe 101: Schedule dashboard refresh Setting alerts Recipe 102: Create a query for an alert Recipe 103: Create an alert Cost and performance considerations Conclusion 11. Graph Analysis Introduction Structure Objectives What is a graph When to use graph operations GraphX GraphFrames Recipe 104: Creating a GraphFrame Recipe 105: Using example graphs Graph operations and algorithms Recipe 106: Breadth-first search Recipe 107: PageRank Recipe 108: Shortest path Recipe 109: Connected components Recipe 110: Strongly connected components Recipe 111: Label Propagation Algorithm Recipe 112: Motif finding Neo4J and Databricks Recipe 113: Using AuraDB Recipe 114: Reading Neo4J’s AuraDB from Databricks Conclusion 12. Visualizations Introduction Structure Objectives Visualization best practices Visually appealing Keep it simple Explain unfamiliar graph types Follow conventions Tell a story Databricks dashboards Recipe 115: Importing sample dashboards Recipe 116: Data preparation for a new dashboard Recipe 117: Creating a dashboard Visualizations in Databricks notebooks Recipe 118: Using visualizations in notebooks Power BI Recipe 119: Connecting Power BI to Databricks Conclusion 13. Governance Introduction Structure Objectives Role of data governance Using Unity Catalog Recipe 120: Configuring Unity Catalog in Azure Creating storage Create a managed identity Create Access Connector for Azure Databricks Grant managed identity access Creating a metastore Unity Catalog object model Recipe 121: Creating a new catalog Recipe 122: Uploading data Recipe 123: Creating a table Installing and using Purview Recipe 124: Installing Purview Recipe 125: Connecting Purview to Databricks Recipe 126: Scanning a Databricks workspace Recipe 127: Browsing the Data Catalog Conclusion 14. Operations Introduction Structure Objectives Source code management and orchestration Recipe 128: Use GitHub with Databricks Recipe 129: Create workflows to orchestrate processing Recipe 130: Saving a Job JSON Recipe 131: Use Airflow to coordinate processing Scheduled and ongoing maintenance Recipe 132: Repairing damaged tables Recipe 133: Vacuum unneeded data Recipe 134: Optimize Delta tables Cost management Recipe 135: Use cluster policies Recipe 136: Using tags to monitor costs Conclusion 15. Tips, Tricks, Troubleshooting, and Best Practices Introduction Structure Objectives Ingesting relational data with Databricks Recipe 137: Loading data from MySQL Recipe 138: Extending a Python class and reading using Databricks runtime format Recipe 139: Caching DataFrames Recipe 140: Loading data from MySQL using workers Performance optimization Using Databricks even log Exploring the Spark UI jobs tab Using the Spark UI SQL/DataFrame tab Recipe 141: Using pools to improve performance Programmatic deployment and interaction Recipe 142: Creating a workspace with ARM Template Recipe 143: Using the Databricks API Reading a Kafka stream Recipe 144: Creating a Kafka cluster Recipe 145: Using confluent cloud Notebook orchestration Recipe 146: Running a notebook with parameters Recipe 147: Conditional execution of notebooks Best practices Organize data assets by source until silver Use automation as much as possibly Use version control Keep each step of the process simple Do not be afraid to change Conclusion Index