Big Data and Hadoop: Fundamentals, tools, and techniques for data-driven success - 2nd Edition [2 ed.] 9789355516664

-In today's data-driven world, harnessing the power of big data is no longer a luxury, but a necessity. This compre

128 26 28MB

English Pages 548 Year 2023

Table of contents :
Cover
Title Page
Copyright Page
Dedication Page
About the Author
About the Reviewer
Acknowledgement
Preface
Table of Contents
1. Big Data Introduction and Demand
Introduction
Structure
Objectives
Big data
Characteristics of big data
Why big data is required
Hadoop
History of Hadoop
Name of Hadoop
Hadoop ecosystem
Convergence of key trends
Convergence of big data into business
Big data versus other techniques
Unstructured data
Mining unstructured data
Unstructured data and large data
Implementing unstructured data management
Industry examples of big data
Use of big data: Hadoop at Yahoo
RackSpace for log processing
Hadoop at Facebook
Usages of big data
Machine learning tools
Entertainment fields
Web analytics
Big data and marketing
Big data and fraud
Risk management in big data with credit card
Big data and algorithm trading
Big data in health care
Conclusion
2. NoSQL Data Management
Introduction
Structure
Objectives
Terminology used in NoSQL and RDBMS
Database used in NoSQL
Key–value database
Document database
Apache CouchDB
MongoDB
Column family database
Components
Table representation of Google Bigtable
BigTable derivatives
Graph database
Neo4j
GraphQL
SQL versus NoSQL
Denormalization
Data distribution
Data durability
Consistency issues in NoSQL
ACID versus BASE
Relaxing consistency
Hbase
Installation of Hbase
History of Hbase
Hbase data structure
Physical storage
Components
Hbase shell commands
Different usages of the scan command
Terminologies
Version stamp
Region
Locking
Conclusion
3. MapReduce Technique
Introduction
Structure
Objectives
MapReduce architecture
MapReduce datatype
File input format
Java MapReduce
Partitioner and combiner
Example of MapReduce
Situation for partitioner and combiner
Use of combiner
Composing MapReduce calculations
Conclusion
4. Basics of Hadoop
Introduction
Structure
Objectives
Data distribution
Data format
Analyzing data with Hadoop
Scale-in versus scale-out
Number of reducers used
Driver class with no reducer
Hadoop streaming
Streaming in Ruby
Streaming in Python
Streaming in Java
Hadoop pipes
Design of HDFS
Very large
Streaming data access
Commodity hardware
Low-latency data access
Lots of small files
Arbitrary file modifications
HDFS concept
Blocks
Namenodes and DataNodes
HDFS group
All-time availability
Hadoop files system
Java interface
HTTP
APIs in C language
Filesystem in Userspace
Reading data using the Java interface
Reading data using Java interface (FileSystem API)
Data flow
File read
File write
Coherency model
Cluster balance
Hadoop archive
Hadoop I/O
Data integrity
Local file system
Compression
Codecs
Compression and input splits
Map output
Serialization
Avro file-based data structure
Data type and schemas
Serialization and deserialization
Avro MapReduce
Conclusion
5. Hadoop Installation
Introduction
Structure
Objectives
Using standalone (local) mode
VmWare
On Ubuntu 16.04
Fully distributed mode
Installation and configuration of multi-node cluster
Conclusion
6. MapReduce Applications
Introduction
Structure
Objectives
Understanding MapReduce
Traditional way
MapReduce workflow
Map side
Reduce side
Sample program using MapReduce
Introduction of Web UI
Debugging MapReduce job
Job chaining and job control
Anatomy of MapReduce job
Anatomy of file write
Anatomy of file read
MapReduce job run
Classic MapReduce: MapReduce 1
Failure in MapReduce1
MapReduce2 YARN
Failure in MapReduce 2
Conclusion
7. Hadoop Related Tools-I: HBase and Cassandra
Introduction
Structure
Objectives
Installation of Hbase
Conceptual architecture
Regions and region server
Master Server
Locking
Implementation
HBase versus RDBMS
HBase client
Class HTable
Class Put
Class Get
Class Delete
Class Result
HBase examples and commands
HBase using Java APIs
Creating a table
List of the tables in HBase
Disable a table
Add column family
Deleting column family
Verifying the existence of the table
Deleting table
Disabling table
Stopping HBase
Challenges
Cassandra
CAP theorem
Explanation in terms of intersection points
Characteristics of Cassandra
Installing Cassandra
Basic CLI commands
Cassandra data model
Super column family
Clusters
Keyspaces
Column families
Super columns
Cassandra examples
Creating a keyspace
Alter keyspace
Dropping a keyspace
Create table
Primary key
Alter table
Truncate table
Executing batch
Delete entire row
Describe
Cassandra client
Thrift
Avro
Hector
Hadoop integration
Use cases
eBay
Hulu
Conclusion
8. Hadoop Related Tools-II: PigLatin and HiveQL
Introduction
Structure
Objectives
Apache PigLatin
Installation
Execution type
Local mode
MapReduce mode
The platform for running Pig programs
Script
Grunt
Embedded
Grunt Shell
Example
Commands in grunt
Pig data model
Scalar
Complex
PigLatin
Input and output
Store
Relational operations
Examples
User-defined functions
Developing and testing the PigLatin script
Dump operator
Describe operator
Explanation operator
Illustration operator
Hive
Installing Hive
Hive architecture
Hive services
Data type and file format
Comparison of HiveQL with traditional database
Schema on read versus write
Update, transactions and indexes
HiveQL
Data definition language
Data manipulation language
Conclusion
9. Practical and Research-based Topics
Introduction
Structure
Objectives
Data analysis with X
Using flume
Using MapReduce
Use of Bloom filter in MapReduce
The function of the bloom filter
Working of Bloom filter
Application of Bloom filter
Implementation of bloom filter in MapReduce
Amazon Web Service
Setting up AWS
Setting up Hadoop on EC2
Examples of data analysis
Document archived from NY Times
Data mining in mobiles
Hadoop diagnosis
System’s health
Setting permission
Managing quotas
Enabling trash
Removing DataNode
Conclusion
10. Spark
Introduction
Structure
Objectives
Spark programming model
Record linkage
Spark shell
SCALA programming model
Features of Scala
Work on Scala
Resilient Distributed Dataset
Spark methods for data processing
Aggregate
Cartesian
Checkpoint
Repartition
Cogroup
Collect
CollectAsMap
CombineByKey
Compute
Count
CountByKey
CountByValue
countApproxDistinct
Dependencies
Distinct
first
filter and filterWith
filter Transformation
fold
foreach
getStorageLevel
groupBy
Histogram
id
join
leftOuterJoin
Example of programs using Scala
Shuffling
Common Spark memory issues
Conclusion
Index

Big Data and Hadoop: Fundamentals, tools, and techniques for data-driven success - 2nd Edition [2 ed.]
9789355516664

Author / Uploaded
Mayank Bhushan

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

Recommend Papers

Big Data and Hadoop: Learn by Example 9789386551993, 9386551993

The book contains the latest trend in IT industry ‘BigData and Hadoop’. It explains how big is ‘Big Data’ and why everyb

412 19 6MB Read more

Pro Hadoop Data Analytics Designing and Building Big Data Systems using the Hadoop Ecosystem [1st ed] 9781484219102, 9781484219096, 1484219090, 1484219104

Learn advanced analytical techniques and leverage existing tool kits to make your analytic applications more powerful, p

407 32 22MB Read more

Data mining: practical machine learning tools and techniques [2nd ed] 9780120884070, 0120884070

As with any burgeoning technology that enjoys commercial attention, the use of data mining is surrounded by a great deal

478 2 188KB Read more

Big Data in Emergency Management: Exploitation Techniques for Social and Mobile Data [1st ed.] 9783030480981, 9783030480998

This contributed volume discusses essential topics and the fundamentals for Big Data Emergency Management and primarily

519 73 6MB Read more

Big Data and Social Science: Data Science Methods and Tools for Research and Practice [2 ed.] 0367341875, 9780367341879

Big Data and Social Science: Data Science Methods and Tools for Research and Practice, Second Edition shows how to apply

548 18 26MB Read more

Data Modeler's Workbench: Tools and Techniques for Analysis and Design [1st ed.] 9780471111757, 0471111759

This manual introduces 20 tools for improving the speed, accuracy, flexibility, and consistency of databases, data wareh

421 73 2MB Read more

Machine Learning and AI for Healthcare : Big Data for Improved Health Outcomes [2nd ed.] 9781484265369, 9781484265376

This updated second edition offers a guided tour of machine learning algorithms and architecture design. It provides rea

581 115 6MB Read more

Measurement, Data Analysis, and Sensor Fundamentals for Engineering and Science, 2nd Edition [2 ed.] 1439825688, 9781439825686

Presenting the fundamental tools of experimentation that are currently used by engineers and scientists, Measurement and

434 48 6MB Read more

Visual Data Mining: Techniques and Tools for Data Visualization and Mining [1st ed.] 9780471149996, 0-471-14999-3

Marketing analysts use data mining techniques to gain a reliable understanding of customer buying habits and then use th

507 54 20MB Read more

Applied Modeling Techniques and Data Analysis 1: Computational Data Analysis Methods and Tools 1786306735, 9781786306739

BIG DATA, ARTIFICIAL INTELLIGENCE AND DATA ANALYSIS SET Coordinated by Jacques Janssen Data analysis is a scientific fi

209 82 11MB Read more