Practical Hadoop Migration: How to Integrate Your RDBMS with the Hadoop Ecosystem and Re-Architect Relational Applications to NoSQL [1 ed.]
1484212886, 9781484212882
Re-architect relational applications to NoSQL, integrate relational database management systems with the Hadoop ecosyste
162
15
12MB
English
Pages 329
[321]
Year 2016
Report DMCA / Copyright
DOWNLOAD PDF FILE
Table of contents :
Contents at a Glance
Contents
Foreword
About the Author
About the Technical Reviewer
Acknowledgments
Introduction
Chapter 1: RDBMS Meets Hadoop: Integrating, Re-Architecting, and Transitioning
Conceptual Differences Between Relational and HDFS NoSQL Databases
Relational Design and Hadoop in Conjunction: Advantages and Challenges
Type of Data
Data Volume
Business Need
Deciding to Integrate, Re-Architect, or Transition
Type of Data
Type of Application
Business Objectives
How to Integrate, Re-Architect, or Transition
Integration
Re-Architecting Using Lambda Architecture
Batch Layer
Serving Layer
Speed Layer
Transition to Hadoop/NoSQL
Type of Data
Data Volume
Data Distribution
Migrating the Data
Summary
Part I: Relational Database Management Systems: A Review of Design Principles, Models and Best Practices
Chapter 2: Understanding RDBMS Design Principles
Overview of Design Methodologies
Top-down
Bottom-up
SSADM
Exploring Design Methodologies
Top-down
Bottom-up
SSADM
Feasibility Study
Investigation of the Current Environment
Business System Options
Requirements Specification
Technical System Options
Logical Design
Physical Design
Pros and Cons of SSADM
Components of Database Design
Normal Forms
First Normal Form
Second Normal Form
Third Normal Form
Keys in Relational Design
Optionality and Cardinality
Supertypes and Subtypes
Summary
Chapter 3: Using SSADM for Relational Design
Feasibility Study
Project Initiation Plan
Requirements and User Catalogue
Requirements Catalogue
User Catalogue
Current Environment Description
Current System Description
Current Physical Data Flow Model
Current Logical Data Model
Proposed Environment Description
Business Activity Model
Data Specification
Function Specification
Problem Definition
Feasibility Study Report
Requirements Analysis
Investigation of Current Environment
Current Data Flow Model
Current Logical Data Model
Requirements Catalogue
User Catalogue
Logical Data Store/Entity Cross-Reference
Logical View of Current Services and System Scope
Business System Options
Requirements Specification
Data Flow Model
Logical Data Model
Function Definitions
GetPlayerInjuryInfo
GetPlayerChronicCondInfo
GetPlayerContractDetails
GetPlayerScheduleInfo
CalculateLossOfPlayPremium
EvalLossOfPlayClaim
Effect Correspondence Diagrams (ECDs)
Entity Life Histories (ELHs)
Logical System Specification
Technical Systems Options
Logical Design
Update Processing Model
Enquiry Processing Model
Data Catalogue
Physical Design
Logical to Physical Transformation
Space Estimation Growth Provisioning
Optimizing Physical Design
Summary
Chapter 4: RDBMS Design and Implementation Tools
Database Design Tools
CASE tools
Building and Using Design Layers
Categorizing Design Using Subject Areas
Display Level of a Model
Forward and Reverse Engineering
Creating Reusable Components
Propagating a Change Easily and Quickly
Diagramming Tools
Administration and Monitoring Applications
Database Administration or Management Applications
Monitoring Applications
Summary
Part II: Hadoop: A Review of the Hadoop Ecosystem, NoSQL Design Principles and Best Practices
Chapter 5: The Hadoop Ecosystem
Query Tools
Spark SQL
Presto
Analytic Tools
Apache Kylin
Kylin Architecture
In-Memory Processing Tools
Flink
Flink Architecture
Search and Messaging Tools
Summary
Chapter 6: Re-Architecting for NoSQL: Design Principles, Models and Best Practices
Design Principles for Re-Architecting Relational Applications to NoSQL Environments
Selecting an Appropriate NoSQL Database
Key-Value Stores
Document Databases
Columnar Databases
Graph Databases
Domain Description
Nodes
Labels
Relationships
Creating Attributes
Concurrency and Security for NoSQL
Concurrency
Security
Designing the Transition Model
Denormalization of Relational (OLTP) Data
Denormalization of Relational (OLAP) Data
Implementing the Final Model
Columnar Database as a NoSQL Target
Document Database as a NoSQL Target
Best Practices for NoSQL Re-Architecture
Summary
Part III: Integrating Relational Database Management Systems with the Hadoop Distributed File System
Chapter 7: Data Lake Integration Design Principles
Data Lake vs. Data Warehouse
Data Warehouse
Data Lake
Concept of a Data Lake
Data Reservoirs
Data Reservoir Repositories
Data Reservoir Services
Governance Engine
Authentication
Authorization
PII Masking
Encryption
Encryption at Rest
Encryption in Transit
Data Quality Services
Data Cleansing
Matching
Data Profiling
Factors for a Successful Implementation
Exploratory Lakes
Data Validation for Exploratory Analysis
Exploratory Analysis Through Visualizations
Correlation
Clustering
Hierarchical Clustering
K-means Clustering
Factors for a Successful Implementation
Analytical Lakes
Using Data for Analytical Models
Model Building Steps
Using Data as a Staging Area for EDW or Data Mart
Real-Time Processing and Analytics
Event Stream Processing
Complex Event Processing
Factors for a Successful Implementation
Summary
Chapter 8: Implementing SQOOP and Flume-based Data Transfers
Deciding on an ETL Tool
Sqoop vs. Flume
Processing Streaming Data
Spark and Spark Streaming
Storm
Samza
Using SQOOP for Data Transfer
Using Flume for Data Transfer
Flume Architecture
Understanding and Using Flume Components
Source
Sink
Implementing Log Consolidation Using Flume
Summary
Part IV: Transitioning from Relational to NoSQL Design Models
Chapter 9: Lambda Architecture for Real-time Hadoop Applications
Defining and Using the Lambda Layers
Batch Layer
Designing Your Master Data
Fact-Based Model
Applying a Fact-based Model to Relational Applications
Building Batch Views
Designing Batch Views for Your Fact-based Model
Implementing Batch Views
Serving Layer
ElephantDB
Splout SQL
Speed Layer
Pros and Cons of Using Lambda
Benefits of Lambda
Issues with Lambda
The Kappa Architecture
Future Architectures1
A Bit of History
Butterfly Architecture
Storage for Butterfly Architecture
Ampool
Example Use Case: Ad Tech Data Pipeline
The Data
User Profiles
Advertisements
Content Metadata
Ad Serving Logs
Computations
Ingestion and Streaming Analytics
Batch Model Building
Interactive and Ad Hoc SQL Queries
Summary
Chapter 10: Implementing and Optimizing the Transition
Hardware Configuration
Cluster Configuration
Operating System Configuration
Hadoop Configuration
HDFS Configuration
JVM/YARN/MapReduce Configuration
Generic JVM Guidelines
Generic YARN/MapReduce Guidelines
Optimizing MapReduce Applications
Optimizing YARN Execution
Choosing an Optimal File Format
Row-based Formats
Text Files
Sequence Files
Avro
Column-based Formats
RCFile
ORCFile
Parquet
Indexing Considerations for Performance
Compact indexes
Bitmap Indexes
Choosing a NoSQL Solution and Optimizing Your Data Model
Summary
Part V: Case Study for Designing and Implementing a Hadoop-based Solution
Chapter 11: Case Study: Implementing Lambda Architecture
The Business Problem and Solution
Solution Design
Hardware
Software
Database Design
Considering a Fact-based Model
Data Conditions for Fraudulence
Batch Layer Design
Implementing Batch Layer
Implementing the Serving Layer
Implementing the Speed Layer
Storage Structures (for Master Data and Views)
Other Performance Considerations
Reference Architectures
Changes to Implementation for Latest Architectures
Re-Implementation Using Kappa Architecture
Changes for Fast Data Architecture
Changes for Butterfly Architecture
Summary
Index