Hadoop Big Data Training
It is a comprehensive Hadoop Big Data course designed by industry experts considering current industry job requirements to provide in-depth learning on big data and Hadoop Modules. This is an industry recognized training course that is a combination of the training courses in Hadoop developer, Hadoop administrator, Hadoop testing, and analytics. This Cloudera Hadoop training will prepare you to clear big data certification.
1. Introduction to Big Data & Hadoop, Hadoop Ecosystem, Map Reduce and HDFS
Topics – Introduction of Hadoop, Problems with data growth, Solving Data Problems, Hadoop Overview, Understanding Mapreduce, Setting the stage for big data problem solving with MapReduce, Parallel Copying with Hadoop distcp, Hadoop fs, Hadoop Archives
2. Introduction to HDFS
Topics – Introduction to Distributed File System, What is Hadoop Distributed file System (HDFS) , HDFS Design Principle & Failure, HDFS Architecture High Availability Mode and Federated Mode, Overall Architecture of HDFS, HDFS Demons, Basic HDFS Commands, Understanding Map Reduce, Hadoop Architecture, Difference between MR1 and MR2, What is YARN, Yarn jobs, Resource Management.
3. Hadoop Installation & setup
Topics – Hadoop 2.x Cluster Architecture , Federation and High Availability, A Typical Production Hadoop Cluster, Hadoop Cluster Modes, Common Hadoop Shell Commands, Hadoop 2.x Configuration Files, Cloudera Single node cluster
4. Introduction to Map Reduce
Topics – What is Hadoop Map Reduce and examples, Conceptual Understanding between Map and Reduce, Anatomy of a YARN Application Run, YARN MR Application Execution Flow, YARN Workflow,Write a Map Reduce Programme using Hadoop Framework
5. Deep Dive in Map Reduce
Topics – What is Functional Programming, Difference between Functional and Imperative Programming, What is Mapping, What is Reducer, Phase of Map and Reduce,Combiner , Partitioner, Shuffle & Sort Phase, Map reduce job submission flow, Map Reduce Types- Input and Output Formats, Custom Formats, Hadoop APIs, exercise on Input and Output Format, Task Execution, Hadoop commands , Map Reduce Features : Counters, Sorting, Reduce Joins, Side Data Distribution ,Map Reduce Library Classes, Hadoop Streaming, Aggregate Data, Example of calculating time a user has spent on an Activity.
6. Problem Solving using Map Reduce:
Topics – Map Reduce Problem Statement, Hadoop Mapper, Mapper Problem, How to Handle Multiple Mapper, Multiple Inputs,Working with Multiple Input Formats
7. Deep Dive in Pig
A. Introduction to Pig
Topics – What Is Pig?, Pig’s Features, Pig Use Cases, Interacting with Pig
B. Basic Data Analysis with Pig
Topics – Pig Latin Syntax, Loading Data, Simple Data Types, Field Definitions, Data Output, Viewing the Schema, Filtering and Sorting Data, Commonly-Used Functions, Hands-On Exercise: Using Pig for ETL Processing
C. Processing Complex Data with Pig
Topics – Complex/Nested Data Types, Grouping, Iterating Grouped Data, Hands-On Exercise: Analyzing Data with Pig
D. Multi-Data set Operations with Pig
Topics – Techniques for Combining Data Sets, Joining Data Sets in Pig, Set Operations, Splitting Data Sets, Hands-On Exercise
E. Extending Pig
Topics – Macros and Imports, UDFs, Using Other Languages to Process Data with Pig, Hands-On Exercise: Extending Pig with Streaming and UDFs
F. Pig Jobs
Case studies of Fortune 500 companies which are Electronic Arts and Walmart with real data sets.
8. Deep Dive in Hive
A. Introduction to Hive
Topics – What Is Hive?, Hive Schema and Data Storage, Comparing Hive to Traditional Databases, Hive vs. Pig, Hive Use Cases, Interacting with Hive
B. Relational Data Analysis with Hive
Topics – Hive Databases and Tables, Basic HiveQL Syntax, Data Types, Joining Data Sets, Common Built-in Functions,Hands-on Exercise: Running Hive Queries on the Shell, Scripts, and Hue
C. Hive Data Management
Topics – Hive Data Formats, Creating Databases, Modeling in Hive and Hive-Managed Tables, Loading Data into Hive, Altering Databases and Tables, Self-Managed Tables, Simplifying Queries with Views, Storing Query Results, Controlling Access to Data, Hands-On Exercise: Data Management with Hive, Thrift server, Meta store in Hive,
D. Hive Optimization
Topics – Understanding Query Performance, Partitioning, Bucketing, Indexing Data
E. Extending Hive
Topics – User-Defined Functions in Hive
F. Hands on Exercises – Playing with huge data and Querying extensively.
G. User defined Functions, Optimizing Queries, Tips and Tricks for performance tuning
9. (AVRO)Data Formats
Topics – Selecting a File Format, Hadoop Tool Support for File Formats, Avro Schemas, Using Avro with Hive and Sqoop, Avro Schema Evolution, Compression
10. Introduction to Hbase architecture
Topics – What is Hbase, Where does it fits, What is NOSQL
11. Apache Spark
A. Why Spark? Explain Spark and Hadoop Distributed File System
Topics – What is Spark, Comparison with Hadoop, Components of Spark
B. Spark Components, Common Spark Algorithms-Iterative Algorithms, Graph Analysis, Machine Learning
Topics – Apache Spark- Introduction, Consistency, Availability, Partition, Unified Stack Spark, Spark Components, Comparison with Hadoop – Scalding example, mahout, storm, graph
C. Running Spark on a Cluster, Writing Spark Applications using Python, Java, Scala
Topics – Explain python example, Show installing a spark, Explain driver program, Explaining spark context with example, Define weakly typed variable, Combine scala and java seamlessly, Explain concurrency and distribution., Explain what is trait, Explain higher order function with example, Define OFI scheduler, Advantages of Spark, Example of Lamda using spark, Explain Mapreduce with example
12. Major Project – Putting it all together and Connecting Dots
Topics – Putting it all together and Connecting Dots, Working with Large data sets, Steps involved in analyzing large data
13. ETL Connectivity with Hadoop Ecosystem
Topics – How ETL tools work in Big data Industry, Connecting to HDFS from ETL tool and moving data from Local system to HDFS, Moving Data from DBMS to HDFS, Working with Hive with ETL Tool, Creating Map Reduce job in ETL tool End to End ETL PoC showing Hadoop integration with ETL tool.
14. Hadoop Cluster Configuration
Topics – Hadoop configuration overview and important configuration file, Configuration parameters and values, HDFS parameters MapReduce parameters, Hadoop environment setup, ‘Include’ and ‘Exclude’ configuration files,
15. Hadoop Administration and Maintenance
Topics – Namenode/Datanode directory structures and files, File system image and Edit log, The Checkpoint Procedure, Namenode failure and recovery procedure, Safe Mode, Metadata and Data backup, Potential problems and solutions / what to look for, Adding and removing nodes, Lab: MapReduce File system Recovery
16. Hadoop Monitoring and Troubleshooting
Topics – Best practices of monitoring a Hadoop cluster, Using logs and stack traces for monitoring and troubleshooting, Using open-source tools to monitor Hadoop cluster
17. ZOOKEEPER
Topics – ZOOKEEPER Introduction, ZOOKEEPER use cases, ZOOKEEPER Services, ZOOKEEPER data Model, Znodes and its types, Znodes operations, Znodes watches, Znodes reads and writes, Consistency Guarantees, Cluster management, Leader Election, Distributed Exclusive Lock, Important points
18. Advance Oozie
Topics – Why Oozie?, Installing Oozie, Running an example, Oozie- workflow engine, Example M/R action, Word count example, Workflow application, Workflow submission, Workflow state transitions, Oozie job processing, Oozie Hadoop security, Why Oozie security?, Job submission to hadoop, Multi tenancy and scalability, Time line of Oozie job, Coordinator, Bundle, Layers of abstraction, Architecture, Use Case 1: time triggers, Use Case 2: data and time triggers, Use Case 3: rolling window
19. Advance Flume
Topics – Overview of Apache Flume, Flume for Hadoop, Physically distributed Data sources, Changing structure of Data, Closer look, Anatomy of Flume, Core concepts, Event, Clients, Agents, Source, Channels, Sinks, Interceptors, Channel selector, Sink processor, Data ingest, Agent pipeline, Transactional data exchange, Routing and replicating, Why channels?, Use case- Log aggregation, Adding flume agent, Handling a server farm, Data volume per agent, Example describing a single node flume deployment
20. Hadoop Stack Integration Testing
Topics – Why Hadoop testing is important, Unit testing, Integration testing, Performance testing, Diagnostics, Nightly QA test, Benchmark and end to end tests, Functional testing, Release certification testing, Security testing, Scalability Testing, Commissioning and Decommissioning of Data Nodes Testing, Reliability testing, Release testing
21. Roles and Responsibilities of Hadoop Testing
Topics – Understanding the Requirement, preparation of the Testing Estimation, Test Cases, Test Data, Test bed creation, Test Execution, Defect Reporting, Defect Retest, Daily Status report delivery, Test completion, ETL testing at every stage (HDFS, HIVE, HBASE) while loading the input (logs/files/records etc) using sqoop/flume which includes but not limited to data verification, Reconciliation, User Authorization and Authentication testing (Groups, Users, Privileges etc), Report defects to the development team or manager and driving them to closure, Consolidate all the defects and create defect reports, Validating new feature and issues in Core Hadoop.