Amazon Redshift: The Definitive Guide: Jump-Start Analytics Using Cloud Data Warehousing
9781098135300
Amazon Redshift powers analytic cloud data warehouses worldwide, from startups to some of the largest enterprise data wa
115
103
24MB
English
Pages 456
Year 2023
Report DMCA / Copyright
DOWNLOAD EPUB FILE
Table of contents :
Foreword
Preface
Conventions Used in This Book
Using Code Examples
O’Reilly Online Learning
How to Contact Us
Acknowledgments
1. AWS for Data
Data-Driven Organizations
Business Use Cases
New Business Use Cases with Generative AI
Modern Data Strategy
Comprehensive Set of Capabilities
Integrated Set of Tools
End-to-End Data Governance
Modern Data Architecture
Role of Amazon Redshift in a Modern Data Architecture
Real-World Benefits of Adopting a Modern Data Architecture
Reference Architecture for Modern Data Architecture
Data Sourcing
Extract, Transform, and Load
Storage
Storage in the data warehouse
Storage in the data lake
Analysis
Comparing transactional databases, data warehouses, and data lakes
Data Mesh and Data Fabric
Data Mesh
Data Fabric
Summary
2. Getting Started with Amazon Redshift
Amazon Redshift Architecture Overview
Get Started with Amazon Redshift Serverless
Creating an Amazon Redshift Serverless Data Warehouse
Sample Data
Activate Sample Data Models and Query Using the Query Editor
When to Use a Provisioned Cluster?
Creating an Amazon Redshift Provisioned Cluster
Estimate Your Amazon Redshift Cost
Amazon Redshift Managed Storage
Amazon Redshift Serverless Compute Cost
Setting a different value for the base capacity
High/frequent usage
Amazon Redshift Provisioned Compute Cost
AWS Account Management
Connecting to Your Amazon Redshift Data Warehouse
Private/Public VPC and Secure Access
Stored Password
Temporary Credentials
Federated User
SAML-Based Authentication from an Identity Provider
Native IdP Integration
Amazon Redshift Data API
Querying a Database Using the Query Editor V2
Federated user
Temporary credentials
Database username and password
AWS Secrets Manager
Business Intelligence Using Amazon QuickSight
Connecting to Amazon Redshift Using JDBC/ODBC
Summary
3. Setting Up Your Data Models and Ingesting Data
Data Lake First Versus Data Warehouse First Strategy
Data Lake First Strategy
Data Warehouse First Strategy
Deciding On a Strategy
Defining Your Data Model
Database Schemas, Users, and Groups
Star Schema, Denormalized, Normalized
Student Information Learning Analytics Dataset
Create Data Models for Student Information Learning Analytics Dataset
Load Batch Data into Amazon Redshift
Using the COPY Command
Ingest Data for the Student Learning Analytics Dataset
Building a Star Schema
Continuous File Ingestion from Amazon S3
Using AWS Glue for Transformations
Manual Loading Using SQL Commands
Using the Query Editor V2
Load Real-Time and Near Real-Time Data
Near Real-Time Replication Using AWS Database Migration Service
Amazon Aurora Zero-ETL Integration with Amazon Redshift
Using Amazon AppFlow
Streaming Ingestion
Steps to get started with streaming ingestion
Important considerations and best practices
Optimize Your Data Structures
Automatic Table Optimization and Autonomics
Distribution Style
Sort Key
Compression Encoding
Summary
4. Data Transformation Strategies
Comparing ELT and ETL Strategies
In-Database Transformation
Semistructured Data
User-Defined Functions
Stored Procedures
Scheduling and Orchestration
Access All Your Data
External Amazon S3 Data
External Operational Data
External Amazon Redshift Data
External Transformation
AWS Glue
Register Amazon Redshift target connection
Build and run your AWS Glue job
Summary
5. Scaling and Performance Optimizations
Scale Storage
Autoscale Your Serverless Data Warehouse
Scale Your Provisioned Data Warehouse
Evolving Compute Demand
Predictable workload changes
Unpredictable Workload Changes
WLM, Queues, and QMR
Queue Assignment
Short Query Acceleration
Query Monitoring Rules
Automatic WLM
Manual WLM
Parameter Group
WLM Dynamic Memory Allocation
Materialized Views
Autonomics
Auto Table Optimizer and Smart Defaults
Auto Vacuum
Auto Vacuum Sort
Auto Analyze
Auto Materialized Views (AutoMV)
Amazon Redshift Advisor
Workload Isolation
Additional Optimizations for Achieving the Best Price and Performance
Database Versus Data Warehouse
Amazon Redshift Serverless
Multi-Warehouse Environment
AWS Data Exchange
Table Design
Indexes Versus Zone Maps
Drivers
Simplify ETL
Query Editor V2
Query Tuning
Query Processing
Query planning and execution workflow
Query stages and system tables
Understanding the query plan
Factors affecting query performance
Analyzing Queries
Reviewing query alerts
Analyzing the query plan
Identifying Queries for Performance Tuning
Summary
6. Amazon Redshift Machine Learning
Machine Learning Cycle
Amazon Redshift ML
Amazon Redshift ML Flexibility
Getting Started with Amazon Redshift ML
Machine Learning Techniques
Supervised Learning Techniques
Unsupervised Learning Techniques
Machine Learning Algorithms
Integration with Amazon SageMaker Autopilot
Create Model
Label Probability
Explain Model
Using Amazon Redshift ML to Predict Student Outcomes
Amazon SageMaker Integration with Amazon Redshift
Integration with Amazon SageMaker—Bring Your Own Model (BYOM)
BYOM Local
BYOM Remote
Amazon Redshift ML Costs
Summary
7. Collaboration with Data Sharing
Amazon Redshift Data Sharing Overview
Data Sharing Use Cases
Key Concepts of Data Sharing
How to Use Data Sharing
Sharing Data Within the Same Account
Sharing Data Across Accounts Using Cross-Account Data Sharing
Analytics as a Service Use Case with Multi-Tenant Storage Patterns
Scaling Your Multi-tenant Architecture Using Data Sharing
Multi-tenant Storage Patterns Using Data Sharing
Pool model
Creating database views in the producer
Creating datashares in producer and granting usage to the consumer
Using Role-Level Security
Bridge model
Creating database schemas and tables in the producer
Creating datashares in the producer and granting usage to the consumer
Silo model
Creating databases and datashares in the producer
Creating datashares in the producer and granting usage to the consumer
External Data Sharing with AWS ADX Integration
Publishing a Data Product
Subscribing to a Published Data Product
Considerations When Using AWS Data Exchange for Amazon Redshift
Query from the Data Lake and Unload to the Data Lake
Amazon DataZone to Discover and Share Data
Use Cases for a Data Mesh Architecture with Amazon DataZone
Key Capabilities and Use Cases for Amazon DataZone
Amazon DataZone Integrations with Amazon Redshift and Other AWS Services
Components and Capabilities of Amazon DataZone
Business data catalog
Projects
Data governance and access control
Data portal
Getting Started with Amazon DataZone
Step 1: Create the domain and data portal
Step 2: Create a producer project
Step 3: Produce data for publishing in Amazon DataZone
Step 4: Publish a data product to the catalog
Step 5: Create a consumer project
Step 6: Discovering and consuming data in Amazon DataZone
Step 7: Approve access to a published data asset as a producer
Step 8: Analyze a published data asset as a consumer
Security in Amazon DataZone
Using Lake Formation-based authorization
Encryption
Implement least privilege access
Use IAM roles
Summary
8. Securing and Governing Data
Object-Level Access Controls
Object Ownership
Default Privileges
Public Schema and Search Path
Access Controls in Action
Database Roles
Database Roles in Action
Row-Level Security
Row-Level Security in Action
Row-Level Security Considerations
Dynamic Data Masking
Dynamic Data Masking in Action
Dynamic Data Masking Considerations
External Data Access Control
Associate IAM Roles
Authorize Assume Role Privileges
Establish External Schemas
Lake Formation for Fine-Grained Access Control
Summary
9. Migrating to Amazon Redshift
Migration Considerations
Retire Versus Retain
Migration Data Size
Platform-Specific Transformations Required
Data Volatility and Availability Requirements
Selection of Migration and ETL Tools
Data Movement Considerations
Domain Name System (DNS)
Migration Strategies
One-Step Migration
Two-Step Migration
Initial data migration
Changed data migration
Iterative Migration
Migration Tools and Services
AWS Schema Conversion Tool
SCT overview
SCT migration assessment report
SCT data extraction agents
Migrating BLOBs to Amazon Redshift
Data Warehouse Migration Service
How AWS DMS works
DMS replication instances
DMS replication validation
AWS Snow Family
AWS Snow Family key features
AWS Snow Family devices
AWS Snowball Edge Client
Database Migration Process
Step 1: Convert Schema and Subject Area
Step 2: Initial Data Extraction and Load
Step 3: Incremental Load Through Data Capture
Amazon Redshift Migration Tools Considerations
Accelerate Your Migration to Amazon Redshift
Macro Conversion
Case-Insensitive String Comparison
Recursive Common Table Expressions
Proprietary Data Types
Summary
10. Monitoring and Administration
Amazon Redshift Monitoring Overview
Monitoring
Troubleshooting
Optimization
Monitoring Using Console
Monitoring and Administering Serverless
Query and database monitoring serverless
Serverless query and database monitoring
Serverless query monitoring drill-down query
Serverless query monitoring drill-down query plan
Serverless query monitoring drill-down related metrics
Resource monitoring
Monitoring Provisioned Data Warehouse Using Console
Data warehouse performance and resource utilization metrics
View Performance Data
CPU utilization
Percentage disk space used
Database connections
Query duration
Query throughput
Query and data ingestion performance metrics: Query Monitoring tab
Query history at data warehouse level
Database performance for queries
Workload concurrency
Monitoring Queries and Loads Across Clusters
Monitoring queries and loads
Monitoring top queries
Identifying Systemic Query Performance Problems
Monitoring Using Amazon CloudWatch
Amazon Redshift CloudWatch Metrics
Monitoring Using System Tables and Views
Monitoring Serverless Using System Views
High Availability and Disaster Recovery
Recovery Time Objective and Recovery Point Objective Considerations
Multi-AZ Compared to Single-AZ Deployment
Creating or Converting a Provisioned Data Warehouse with Multi-AZ Configuration
Creating a new data warehouse with Multi-AZ option
Migrating an existing data warehouse from Single-AZ to Multi-AZ
Auto Recovery of Multi-AZ Deployment
Snapshots, Backup, and Restore
Snapshots for Backup
Automated Snapshots
Manual Snapshots
Disaster Recovery Using Cross-Region Snapshots
Using Snapshots for Simple-Replay
Monitoring Amazon Redshift Using CloudTrail
Bring Your Own Visualization Tool to Monitor Amazon Redshift
Monitor Operational Metrics Using System Tables and Amazon QuickSight
Monitor Operational Metrics Using Grafana Plug-in for Amazon Redshift
Summary
Index