Big Data and Hadoop: Fundamentals, tools, and techniques for data-driven success - 2nd Edition [2 ed.] 9789355516664

-In today's data-driven world, harnessing the power of big data is no longer a luxury, but a necessity. This compre

128 26 28MB

English Pages 548 Year 2023

Report DMCA / Copyright

DOWNLOAD EPUB FILE

Table of contents :
Cover
Title Page
Copyright Page
Dedication Page
About the Author
About the Reviewer
Acknowledgement
Preface
Table of Contents
1. Big Data Introduction and Demand
Introduction
Structure
Objectives
Big data
Characteristics of big data
Why big data is required
Hadoop
History of Hadoop
Name of Hadoop
Hadoop ecosystem
Convergence of key trends
Convergence of big data into business
Big data versus other techniques
Unstructured data
Mining unstructured data
Unstructured data and large data
Implementing unstructured data management
Industry examples of big data
Use of big data: Hadoop at Yahoo
RackSpace for log processing
Hadoop at Facebook
Usages of big data
Machine learning tools
Entertainment fields
Web analytics
Big data and marketing
Big data and fraud
Risk management in big data with credit card
Big data and algorithm trading
Big data in health care
Conclusion
2. NoSQL Data Management
Introduction
Structure
Objectives
Terminology used in NoSQL and RDBMS
Database used in NoSQL
Key–value database
Document database
Apache CouchDB
MongoDB
Column family database
Components
Table representation of Google Bigtable
BigTable derivatives
Graph database
Neo4j
GraphQL
SQL versus NoSQL
Denormalization
Data distribution
Data durability
Consistency issues in NoSQL
ACID versus BASE
Relaxing consistency
Hbase
Installation of Hbase
History of Hbase
Hbase data structure
Physical storage
Components
Hbase shell commands
Different usages of the scan command
Terminologies
Version stamp
Region
Locking
Conclusion
3. MapReduce Technique
Introduction
Structure
Objectives
MapReduce architecture
MapReduce datatype
File input format
Java MapReduce
Partitioner and combiner
Example of MapReduce
Situation for partitioner and combiner
Use of combiner
Composing MapReduce calculations
Conclusion
4. Basics of Hadoop
Introduction
Structure
Objectives
Data distribution
Data format
Analyzing data with Hadoop
Scale-in versus scale-out
Number of reducers used
Driver class with no reducer
Hadoop streaming
Streaming in Ruby
Streaming in Python
Streaming in Java
Hadoop pipes
Design of HDFS
Very large
Streaming data access
Commodity hardware
Low-latency data access
Lots of small files
Arbitrary file modifications
HDFS concept
Blocks
Namenodes and DataNodes
HDFS group
All-time availability
Hadoop files system
Java interface
HTTP
APIs in C language
Filesystem in Userspace
Reading data using the Java interface
Reading data using Java interface (FileSystem API)
Data flow
File read
File write
Coherency model
Cluster balance
Hadoop archive
Hadoop I/O
Data integrity
Local file system
Compression
Codecs
Compression and input splits
Map output
Serialization
Avro file-based data structure
Data type and schemas
Serialization and deserialization
Avro MapReduce
Conclusion
5. Hadoop Installation
Introduction
Structure
Objectives
Using standalone (local) mode
VmWare
On Ubuntu 16.04
Fully distributed mode
Installation and configuration of multi-node cluster
Conclusion
6. MapReduce Applications
Introduction
Structure
Objectives
Understanding MapReduce
Traditional way
MapReduce workflow
Map side
Reduce side
Sample program using MapReduce
Introduction of Web UI
Debugging MapReduce job
Job chaining and job control
Anatomy of MapReduce job
Anatomy of file write
Anatomy of file read
MapReduce job run
Classic MapReduce: MapReduce 1
Failure in MapReduce1
MapReduce2 YARN
Failure in MapReduce 2
Conclusion
7. Hadoop Related Tools-I: HBase and Cassandra
Introduction
Structure
Objectives
Installation of Hbase
Conceptual architecture
Regions and region server
Master Server
Locking
Implementation
HBase versus RDBMS
HBase client
Class HTable
Class Put
Class Get
Class Delete
Class Result
HBase examples and commands
HBase using Java APIs
Creating a table
List of the tables in HBase
Disable a table
Add column family
Deleting column family
Verifying the existence of the table
Deleting table
Disabling table
Stopping HBase
Challenges
Cassandra
CAP theorem
Explanation in terms of intersection points
Characteristics of Cassandra
Installing Cassandra
Basic CLI commands
Cassandra data model
Super column family
Clusters
Keyspaces
Column families
Super columns
Cassandra examples
Creating a keyspace
Alter keyspace
Dropping a keyspace
Create table
Primary key
Alter table
Truncate table
Executing batch
Delete entire row
Describe
Cassandra client
Thrift
Avro
Hector
Hadoop integration
Use cases
eBay
Hulu
Conclusion
8. Hadoop Related Tools-II: PigLatin and HiveQL
Introduction
Structure
Objectives
Apache PigLatin
Installation
Execution type
Local mode
MapReduce mode
The platform for running Pig programs
Script
Grunt
Embedded
Grunt Shell
Example
Commands in grunt
Pig data model
Scalar
Complex
PigLatin
Input and output
Store
Relational operations
Examples
User-defined functions
Developing and testing the PigLatin script
Dump operator
Describe operator
Explanation operator
Illustration operator
Hive
Installing Hive
Hive architecture
Hive services
Data type and file format
Comparison of HiveQL with traditional database
Schema on read versus write
Update, transactions and indexes
HiveQL
Data definition language
Data manipulation language
Conclusion
9. Practical and Research-based Topics
Introduction
Structure
Objectives
Data analysis with X
Using flume
Using MapReduce
Use of Bloom filter in MapReduce
The function of the bloom filter
Working of Bloom filter
Application of Bloom filter
Implementation of bloom filter in MapReduce
Amazon Web Service
Setting up AWS
Setting up Hadoop on EC2
Examples of data analysis
Document archived from NY Times
Data mining in mobiles
Hadoop diagnosis
System’s health
Setting permission
Managing quotas
Enabling trash
Removing DataNode
Conclusion
10. Spark
Introduction
Structure
Objectives
Spark programming model
Record linkage
Spark shell
SCALA programming model
Features of Scala
Work on Scala
Resilient Distributed Dataset
Spark methods for data processing
Aggregate
Cartesian
Checkpoint
Repartition
Cogroup
Collect
CollectAsMap
CombineByKey
Compute
Count
CountByKey
CountByValue
countApproxDistinct
Dependencies
Distinct
first
filter and filterWith
filter Transformation
fold
foreach
getStorageLevel
groupBy
Histogram
id
join
leftOuterJoin
Example of programs using Scala
Shuffling
Common Spark memory issues
Conclusion
Index

Big Data and Hadoop: Fundamentals, tools, and techniques for data-driven success - 2nd Edition [2 ed.]
 9789355516664

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
Recommend Papers