Table of contents : Cover Title Page Copyright Page Dedication Page About the Author About the Reviewer Acknowledgement Preface Table of Contents 1. Big Data Introduction and Demand Introduction Structure Objectives Big data Characteristics of big data Why big data is required Hadoop History of Hadoop Name of Hadoop Hadoop ecosystem Convergence of key trends Convergence of big data into business Big data versus other techniques Unstructured data Mining unstructured data Unstructured data and large data Implementing unstructured data management Industry examples of big data Use of big data: Hadoop at Yahoo RackSpace for log processing Hadoop at Facebook Usages of big data Machine learning tools Entertainment fields Web analytics Big data and marketing Big data and fraud Risk management in big data with credit card Big data and algorithm trading Big data in health care Conclusion 2. NoSQL Data Management Introduction Structure Objectives Terminology used in NoSQL and RDBMS Database used in NoSQL Key–value database Document database Apache CouchDB MongoDB Column family database Components Table representation of Google Bigtable BigTable derivatives Graph database Neo4j GraphQL SQL versus NoSQL Denormalization Data distribution Data durability Consistency issues in NoSQL ACID versus BASE Relaxing consistency Hbase Installation of Hbase History of Hbase Hbase data structure Physical storage Components Hbase shell commands Different usages of the scan command Terminologies Version stamp Region Locking Conclusion 3. MapReduce Technique Introduction Structure Objectives MapReduce architecture MapReduce datatype File input format Java MapReduce Partitioner and combiner Example of MapReduce Situation for partitioner and combiner Use of combiner Composing MapReduce calculations Conclusion 4. Basics of Hadoop Introduction Structure Objectives Data distribution Data format Analyzing data with Hadoop Scale-in versus scale-out Number of reducers used Driver class with no reducer Hadoop streaming Streaming in Ruby Streaming in Python Streaming in Java Hadoop pipes Design of HDFS Very large Streaming data access Commodity hardware Low-latency data access Lots of small files Arbitrary file modifications HDFS concept Blocks Namenodes and DataNodes HDFS group All-time availability Hadoop files system Java interface HTTP APIs in C language Filesystem in Userspace Reading data using the Java interface Reading data using Java interface (FileSystem API) Data flow File read File write Coherency model Cluster balance Hadoop archive Hadoop I/O Data integrity Local file system Compression Codecs Compression and input splits Map output Serialization Avro file-based data structure Data type and schemas Serialization and deserialization Avro MapReduce Conclusion 5. Hadoop Installation Introduction Structure Objectives Using standalone (local) mode VmWare On Ubuntu 16.04 Fully distributed mode Installation and configuration of multi-node cluster Conclusion 6. MapReduce Applications Introduction Structure Objectives Understanding MapReduce Traditional way MapReduce workflow Map side Reduce side Sample program using MapReduce Introduction of Web UI Debugging MapReduce job Job chaining and job control Anatomy of MapReduce job Anatomy of file write Anatomy of file read MapReduce job run Classic MapReduce: MapReduce 1 Failure in MapReduce1 MapReduce2 YARN Failure in MapReduce 2 Conclusion 7. Hadoop Related Tools-I: HBase and Cassandra Introduction Structure Objectives Installation of Hbase Conceptual architecture Regions and region server Master Server Locking Implementation HBase versus RDBMS HBase client Class HTable Class Put Class Get Class Delete Class Result HBase examples and commands HBase using Java APIs Creating a table List of the tables in HBase Disable a table Add column family Deleting column family Verifying the existence of the table Deleting table Disabling table Stopping HBase Challenges Cassandra CAP theorem Explanation in terms of intersection points Characteristics of Cassandra Installing Cassandra Basic CLI commands Cassandra data model Super column family Clusters Keyspaces Column families Super columns Cassandra examples Creating a keyspace Alter keyspace Dropping a keyspace Create table Primary key Alter table Truncate table Executing batch Delete entire row Describe Cassandra client Thrift Avro Hector Hadoop integration Use cases eBay Hulu Conclusion 8. Hadoop Related Tools-II: PigLatin and HiveQL Introduction Structure Objectives Apache PigLatin Installation Execution type Local mode MapReduce mode The platform for running Pig programs Script Grunt Embedded Grunt Shell Example Commands in grunt Pig data model Scalar Complex PigLatin Input and output Store Relational operations Examples User-defined functions Developing and testing the PigLatin script Dump operator Describe operator Explanation operator Illustration operator Hive Installing Hive Hive architecture Hive services Data type and file format Comparison of HiveQL with traditional database Schema on read versus write Update, transactions and indexes HiveQL Data definition language Data manipulation language Conclusion 9. Practical and Research-based Topics Introduction Structure Objectives Data analysis with X Using flume Using MapReduce Use of Bloom filter in MapReduce The function of the bloom filter Working of Bloom filter Application of Bloom filter Implementation of bloom filter in MapReduce Amazon Web Service Setting up AWS Setting up Hadoop on EC2 Examples of data analysis Document archived from NY Times Data mining in mobiles Hadoop diagnosis System’s health Setting permission Managing quotas Enabling trash Removing DataNode Conclusion 10. Spark Introduction Structure Objectives Spark programming model Record linkage Spark shell SCALA programming model Features of Scala Work on Scala Resilient Distributed Dataset Spark methods for data processing Aggregate Cartesian Checkpoint Repartition Cogroup Collect CollectAsMap CombineByKey Compute Count CountByKey CountByValue countApproxDistinct Dependencies Distinct first filter and filterWith filter Transformation fold foreach getStorageLevel groupBy Histogram id join leftOuterJoin Example of programs using Scala Shuffling Common Spark memory issues Conclusion Index